Remove Duplicate rows from a large table - PostgreSQL

Question

I want to remove duplicates from a large table having about 1million rows and increasing every hour. It has no unique id and has about ~575 columns but sparsely filled.

The table is 'like' a log table where new entries are appended every hour without unique timestamp.

The duplicates are like 1-3% but I want to remove it anyway ;) Any ideas?

I tried ctid column (as here) but its very slow.

Daniel Vérité · Answer 1 · 2019-10-18T11:45:45.823

4

The basic idea that works generally well with PostgreSQL is to create an index on the hash of the set of columns as a whole.

Example:

CREATE INDEX index_name ON tablename (md5((tablename.*)::text));

This will work unless there are columns that don't play well with the requirement of immutability (mostly timestamp with time zone because their cast-to-text value is session-dependent).

Once this index is created, duplicates can be found quickly by self-joining with the hash, with a query looking like this:

SELECT t1.ctid, t2.ctid
FROM tablename t1 JOIN tablename t2
 ON (md5((t1.*)::text) = md5((t2.*)::text))
WHERE t1.ctid > t2.ctid;

You may also use this index to avoid duplicates rows in the future rather than periodically de-duplicating them, by making it UNIQUE (duplicate rows would be rejected at INSERT or UPDATE time).

edited Oct 18 '19 at 11:45

answered May 12 '14 at 17:51

Daniel Vérité

58,074
15
129
156

Didn't work for me, since I have timestamps with TZ, but I like the approach. – Mikhail Lisakov Oct 18 '19 at 11:09
I'm trying to do something similar but with a timestamptz. However I know for a fact that the database is UTC and (obviously) so are the timestamps; I'm looking for a workaround to immutability. I want to nuke a few thousand duplicate rows out of millions of rows so I can recreate the primary key... – Jeff Jun 29 '20 at 11:59
2

@Jeff: if you have a candidate primary key then you don't need the above method. Create an index on it, then eliminate the duplicates with a self-join that presumably will use that index, then drop the index, then set the unique constraint. – Daniel Vérité Jun 29 '20 at 12:11
Thanks for the lead! I will try a few things, otherwise I'll create a new question here and let you know. – Jeff Jun 29 '20 at 12:15
@DanielVérité Correct me if im wrong but in the example you've given here the index is created on all columns right? how can I do the same thing but create the md5 index only on specific columns? – Omri Shneor Aug 02 '21 at 08:38
1

@OmriShneor: this answer appears to be obsolete and if you need to deduplicate only on some columns, that's not the same question anyway. Please submit a new question. – Daniel Vérité Aug 02 '21 at 10:04
@DanielVérité why is the answer obsolete? – Omri Shneor Aug 02 '21 at 11:38
@OmriShneor: because PG says that `tablename.*::text` is not immutable no matter what the structure and refuses to create the index on `md5(tablename.*::text)`. This needs to be worked around. – Daniel Vérité Aug 04 '21 at 07:22

Remove Duplicate rows from a large table - PostgreSQL

1 Answers1