Executing Queries to Remove Duplicate Rows in PostgreSQL

Avatar

By squashlabs, Last Updated: October 30, 2023

Executing Queries to Remove Duplicate Rows in PostgreSQL

PostgreSQL provides several techniques to remove duplicate rows from a table. These techniques include using the DISTINCT keyword, GROUP BY and HAVING clauses, Common Table Expressions (CTEs), subqueries, and the EXCEPT operator. Let’s dive into each technique and understand how to apply them.

Using a Query to Remove Duplicate Rows in PostgreSQL

One of the simplest ways to remove duplicate rows in PostgreSQL is by using a query. This involves selecting the distinct rows from the table and inserting them into a new table or replacing the existing table with the distinct rows. Let’s see how this can be done with an example.

Code Snippet: Removing Duplicate Rows with DISTINCT

To remove duplicate rows from a table using the DISTINCT keyword, you can execute the following query:

CREATE TABLE new_table AS
SELECT DISTINCT * FROM original_table;

In this code snippet, we create a new table named “new_table” and select the distinct rows from the “original_table” using the DISTINCT keyword. The “*” denotes selecting all columns from the original table. Once the query is executed, the new table will contain only the distinct rows from the original table.

Code Snippet: Removing Duplicate Rows with GROUP BY and HAVING

Another approach to remove duplicate rows in PostgreSQL is by using the GROUP BY and HAVING clauses. This technique allows you to group the rows based on specific columns and filter out the groups that have more than one row. Here’s an example:

CREATE TABLE new_table AS
SELECT column1, column2, ..., columnN
FROM original_table
GROUP BY column1, column2, ..., columnN
HAVING COUNT(*) = 1;

In this code snippet, we create a new table named “new_table” and select the columns that we want to keep from the original table. We then group the rows based on these columns and use the HAVING clause to filter out the groups that have more than one row. The resulting table will contain only the distinct rows from the original table.

Code Snippet: Removing Duplicate Rows with Common Table Expressions (CTEs)

Common Table Expressions (CTEs) provide a useful way to remove duplicate rows in PostgreSQL. CTEs allow you to define temporary result sets that can be referenced within a query. Here’s an example of using CTEs to remove duplicate rows:

WITH cte AS (
  SELECT column1, column2, ..., columnN, 
         ROW_NUMBER() OVER (PARTITION BY column1, column2, ..., columnN ORDER BY column1) AS rn
  FROM original_table
)
CREATE TABLE new_table AS
SELECT column1, column2, ..., columnN
FROM cte
WHERE rn = 1;

In this code snippet, we define a CTE named “cte” that selects the columns we want to keep from the original table and assigns a row number to each row based on the specified columns using the ROW_NUMBER() window function. We then create a new table named “new_table” and select the columns from the CTE where the row number is 1. This will give us the distinct rows based on the specified columns.

Code Snippet: Removing Duplicate Rows with Subqueries

Subqueries can also be used to remove duplicate rows in PostgreSQL. A subquery is a query nested within another query. Here’s an example of using a subquery to remove duplicate rows:

CREATE TABLE new_table AS
SELECT column1, column2, ..., columnN
FROM original_table
WHERE (column1, column2, ..., columnN) IN (
  SELECT column1, column2, ..., columnN
  FROM original_table
  GROUP BY column1, column2, ..., columnN
  HAVING COUNT(*) = 1
);

In this code snippet, we create a new table named “new_table” and select the columns we want to keep from the original table. We then use a subquery to select the distinct rows from the original table by grouping them based on the specified columns and filtering out the groups that have more than one row. The resulting table will contain only the distinct rows from the original table.

Code Snippet: Removing Duplicate Rows with the EXCEPT Operator

The EXCEPT operator can also be used to remove duplicate rows in PostgreSQL. The EXCEPT operator returns the distinct rows from the left query that are not present in the right query. Here’s an example:

CREATE TABLE new_table AS
SELECT column1, column2, ..., columnN
FROM original_table
EXCEPT
SELECT column1, column2, ..., columnN
FROM original_table
WHERE condition;

In this code snippet, we create a new table named “new_table” and select the columns we want to keep from the original table. We then use the EXCEPT operator to select the distinct rows from the original table that are not present in the second query, which can include additional conditions if needed. The resulting table will contain only the distinct rows from the original table.

Related Article: How to Truncate Tables in PostgreSQL

Identifying Duplicate Rows in PostgreSQL

Before removing duplicate rows, it is often necessary to identify them first. PostgreSQL provides several techniques to identify duplicate rows, including using the COUNT and GROUP BY clauses, self-joins, and window functions. Let’s explore each technique with examples.

Code Snippet: Identifying Duplicate Rows with COUNT and GROUP BY

To identify duplicate rows in PostgreSQL, you can use the COUNT and GROUP BY clauses. Here’s an example:

SELECT column1, column2, ..., columnN, COUNT(*) AS duplicate_count
FROM original_table
GROUP BY column1, column2, ..., columnN
HAVING COUNT(*) > 1;

In this code snippet, we select the columns we want to check for duplicates from the original table. We then use the COUNT(*) function to count the occurrences of each combination of the specified columns and alias it as “duplicate_count”. Finally, we use the HAVING clause to filter out the groups that have more than one occurrence. The result will be the duplicate rows along with the count of their occurrences.

Code Snippet: Identifying Duplicate Rows with Self-Joins

Self-joins can also be used to identify duplicate rows in PostgreSQL. A self-join is a join operation where a table is joined with itself. Here’s an example:

SELECT t1.column1, t1.column2, ..., t1.columnN
FROM original_table t1
JOIN original_table t2 ON t1.column1 = t2.column1
                       AND t1.column2 = t2.column2
                       ...
                       AND t1.columnN = t2.columnN
WHERE t1.rowid <> t2.rowid;

In this code snippet, we perform a self-join on the original table, comparing each row with all other rows based on the specified columns. We then use the WHERE clause to filter out the rows that have the same values for the specified columns but different row IDs. The result will be the duplicate rows.

Code Snippet: Identifying Duplicate Rows with Window Functions

Window functions provide a useful way to identify duplicate rows in PostgreSQL. Window functions allow you to perform calculations across a set of rows that are related to the current row. Here’s an example:

SELECT column1, column2, ..., columnN,
       COUNT(*) OVER (PARTITION BY column1, column2, ..., columnN) AS duplicate_count
FROM original_table;

In this code snippet, we select the columns we want to check for duplicates from the original table. We then use the COUNT(*) window function to count the occurrences of each combination of the specified columns within the partition defined by the PARTITION BY clause. The result will be the duplicate rows along with the count of their occurrences.

Preventing Duplicate Rows in PostgreSQL

In addition to removing duplicate rows, it is crucial to prevent them from occurring in the first place. PostgreSQL provides various mechanisms to prevent duplicate rows, including UNIQUE constraints, CHECK constraints, triggers, and partial indexes. Let’s explore each mechanism with examples.

Code Snippet: Preventing Duplicate Rows with UNIQUE Constraints

UNIQUE constraints ensure that the values in a column or a group of columns are unique across the table. Here’s an example:

CREATE TABLE example (
  column1 INTEGER,
  column2 VARCHAR(50),
  column3 DATE,
  UNIQUE (column1, column2)
);

In this code snippet, we create a table named “example” with three columns. We then add a UNIQUE constraint on columns 1 and 2, ensuring that the combination of values in these columns is unique across the table. If an attempt is made to insert or update a row with duplicate values in these columns, an error will be thrown.

Code Snippet: Preventing Duplicate Rows with CHECK Constraints

CHECK constraints allow you to define conditions that must be true for each row in a table. Here’s an example:

CREATE TABLE example (
  column1 INTEGER,
  column2 VARCHAR(50),
  column3 DATE,
  CHECK (column1 > 0)
);

In this code snippet, we create a table named “example” with three columns. We then add a CHECK constraint on column 1, ensuring that the value in this column is greater than 0 for each row. If an attempt is made to insert or update a row with a value less than or equal to 0 in column 1, an error will be thrown.

Code Snippet: Preventing Duplicate Rows with Triggers

Triggers are database objects that are automatically executed when a specified event occurs, such as inserting, updating, or deleting a row. We can use triggers to prevent duplicate rows by checking the values before they are inserted or updated. Here’s an example:

CREATE OR REPLACE FUNCTION prevent_duplicates()
RETURNS TRIGGER AS $$
BEGIN
  IF EXISTS (
    SELECT 1
    FROM example
    WHERE column1 = NEW.column1
      AND column2 = NEW.column2
  ) THEN
    RAISE EXCEPTION 'Duplicate rows are not allowed.';
  END IF;
  
  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER example_trigger
BEFORE INSERT OR UPDATE ON example
FOR EACH ROW
EXECUTE FUNCTION prevent_duplicates();

In this code snippet, we create a trigger function named “prevent_duplicates” that checks if a row with the same values already exists in the “example” table. If a duplicate row is found, an exception is raised. We then create a trigger named “example_trigger” that executes the “prevent_duplicates” function before each insert or update operation on the “example” table.

Code Snippet: Preventing Duplicate Rows with Partial Indexes

Partial indexes allow you to create indexes on a subset of rows in a table, based on a specified condition. Here’s an example:

CREATE UNIQUE INDEX example_index
ON example (column1)
WHERE column2 = 'value';

In this code snippet, we create a unique index named “example_index” on column 1 of the “example” table. The index is created only for rows where column 2 has the value ‘value’. This ensures that the combination of values in columns 1 and 2 is unique for the subset of rows specified by the WHERE clause.

Related Article: How to Drop All Tables in a PostgreSQL Database