0

I have a PostgreSQL table with a very large number of columns. The table does not have a primary key and now contains several rows that are 100% duplicates of another row.

How can I remove those duplicates without deleting the original along with them?

I found this answer on a related question, but I'd have to spell out each and every column name, which is error-prone. How can I avoid having to know anything about the table structure?

Example:

Given

create table duplicated (
 id int,
 name text,
 description text
);

insert into duplicated
values (1, 'A', null), 
       (2, 'B', null),
       (2, 'B', null),
       (3, 'C', null), 
       (3, 'C', null), 
       (3, 'C', 'not a DUPE!');

after deletion, the following rows should remain:

(1, 'A', null) 
(2, 'B', null)
(3, 'C', null) 
(3, 'C', 'not a DUPE!')
blubb
  • 9,510
  • 3
  • 40
  • 82
  • @a_horse_with_no_name: edited my question to explain the difference to the question you proposed as duplicate: one aspect I couldn't find there was how to avoid knowing the table structure. This rendered its answer inapplicable in my scenario. – blubb Oct 28 '19 at 16:00

1 Answers1

1

As proposed in this answer, use the system column ctid to distinguish the physical copies of otherwise indentical rows.

To avoid having to spell out a non-existing 'key' for the rows, simply use the row constructor row(table), which returns a row value containing the entire row as returned by select * from table:

DELETE FROM duplicated
USING (
      SELECT MIN(ctid) as ctid, row(duplicated) as row
        FROM duplicated 
        GROUP BY row(duplicated) HAVING COUNT(*) > 1
      ) uniqued
      WHERE row(duplicated) = uniqued.row
      AND duplicated.ctid <> uniqued.ctid;

You can try it in this DbFiddle.

blubb
  • 9,510
  • 3
  • 40
  • 82