1

I have table text_table in PostgreSQL 12 database with only 2 columns: text1 and text2.
There is an idx_text1 index on the text1 column.

For example:

       text1    text2
----------------------
   1   str1     one
   2   str2     two
   3   str3     three
   4   str2     two
   5   str1     seven
   6   str4     seven
   7   str3     three

How do I remove all duplicate rows (with exact match of values in both columns)?

In the example above, rows 4 and 7 (or 2 and 3) need to be deleted from the table.
I would like to find the fastest way with an SQL query.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228

2 Answers2

1

Without a primary key, such task is tedious. I think the simplest approach may be to backup/restore, like so:

create table tmptable as select distinct text1, text2 from mytable;
truncate table mytable;  -- back it up first!!
insert into mytable select * from tmptable;
drop table tmptable;
GMB
  • 216,147
  • 25
  • 84
  • 135
  • If there is a Primary key in table, how will script look in this case? – Владимир Oct 12 '20 at 16:31
  • @Владимир: you said there was no primary key... so this is the approach I recommend. Even if you have one, this is likely to be the most efficient approach if you are going to delete a significant part of the table rows.If there are just a few duplicates, then you might want to check Erwin Brandstetter's answer. – GMB Oct 12 '20 at 20:56
1

The best performance depends on the size of the table, concurrent activity, the number and avg. size of rows, and most importantly the number of rows to delete.

For only few duplicates, no NULL values (or you don't consider those duplicates), and no PRIMARY KEY, this would do just fine:

DELETE FROM text_table t
WHERE  EXISTS (
   SELECT FROM  text_table 
   WHERE (text1, text2) = (t.text1, t.text2)
   AND   ctid < t.ctid  -- exclude self
-- AND   pk < t.pk      -- alternative with PK
   );

Among duplicates, this keeps the row with the smallest ctid (the first one, physically). (Or the one with the smallest PK value with the alt. syntax.)

About ctid:

Your index on (text1) should help. An index on (text1, text2) would typically help some more, unless strings are very long. (OTOH, when deleting a substantial percentage of all rows, indexes may generate more cost than benefit.)

And consider adding a UNIQUE index or constraint after removal of duplicates to defend against re-introducing more of those.

Related:

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228