Unable to delete duplicate rows with PostgreSQL

Question

My query deletes the whole table instead of duplicate rows. Video as proof: https://streamable.com/3s843

create table customer_info (
    id INT,
    first_name VARCHAR(50),
    last_name VARCHAR(50),
    phone_number VARCHAR(50)
);
insert into customer_info (id, first_name, last_name, phone_number) values
(1, 'Kevin', 'Binley', '600-449-1059'),
(1, 'Kevin', 'Binley', '600-449-1059'),
(2, 'Skippy', 'Lam', '779-278-0889');

My query:

with t1 as (
select *, row_number() over(partition by id order by id) as rn
from customer_info)

delete
from customer_info 
where id in (select id from t1 where rn > 1);

What do yout hink that the subquery `(select id from t1 where rn > 1)` returns? — wildplasser, Aug 18 '19 at 21:52
And: what do you expect `delete from customer_info where id in (1,2);` would do? — wildplasser, Aug 18 '19 at 22:38
I expect `where rn > 1` will return all rows besides the first row. Therefore all rows should be deleted besides the first row. — Kay, Aug 18 '19 at 23:01
The subquery does not return *rows*, it returns a set of `id`s. — wildplasser, Aug 18 '19 at 23:29

Erwin Brandstetter · Accepted Answer · 2019-08-19T00:56:14.500

Your query would delete all rows from each set of dupes (as all share the same id by which you select - that's what @wildplasser hinted at with subtle comments) and only initially unique rows would survive. So if it "deletes the whole table", that means there were no unique rows at all.

In your query, dupes are defined by (id) alone, not by the whole row as your title suggests.

Either way, there is a remarkably simple solution:

DELETE FROM customer_info c
WHERE  EXISTS (
   SELECT FROM customer_info c1
   WHERE  ctid < c.ctid
   AND    c1 = c  -- comparing whole rows
   );

Since you deal with completely identical rows, the remaining way to tell them apart is the internal tuple ID ctid.

My query deletes all rows, where an identical row with a smaller ctid exists. Hence, only the "first" row from each set of dupes survives.

Notably, NULL values compare equal in this case - which is most probably as desired. The manual:

The SQL specification requires row-wise comparison to return NULL if the result depends on comparing two NULL values or a NULL and a non-NULL. PostgreSQL does this only when comparing the results of two row constructors (as in Section 9.23.5) or comparing a row constructor to the output of a subquery (as in Section 9.22). In other contexts where two composite-type values are compared, two NULL field values are considered equal, [...]

If dupes are defined by id alone (as your query suggests), then this would work:

DELETE FROM customer_info c
WHERE  EXISTS (
   SELECT FROM customer_info c1
   WHERE  ctid < c.ctid
   AND    id = c.id
   );

But then there might be a better way to decide which rows to keep than ctid as a measure of last resort!

Obviously, you would then add a PRIMARY KEY to avoid the initial dilemma from reappearing. For the second interpretation, id is the candidate.

How do I (or can I) SELECT DISTINCT on multiple columns?

About ctid:

How do I decompose ctid into page and row numbers?

The Impaler · Answer 2 · 2019-08-19T01:48:49.230

You can't if the table does not have a key.

Tables have "keys" that identify each row uniquely. If your table does not have any key, then you won't be able to identify one row from the other one.

The only workaround to delete duplicate rows I can think of would be to:

Add a key on the table.
Use the key to delete the rows that are in excess.

For example:

create sequence seq1;
alter table customer_info add column k1 int;
update customer_info set k1 = nextval('seq1');

delete from customer_info where k1 in (
  select k1 
  from (
    select
      k1,
      row_number() over(partition by id, first_name, last_name, phone_number) as rn
    from customer_info
  ) x
  where rn > 1
)

Now you only have two rows.

Unable to delete duplicate rows with PostgreSQL

2 Answers2