2

I want to remove duplicates from a large table having about 1million rows and increasing every hour. It has no unique id and has about ~575 columns but sparsely filled.

The table is 'like' a log table where new entries are appended every hour without unique timestamp.

The duplicates are like 1-3% but I want to remove it anyway ;) Any ideas?

I tried ctid column (as here) but its very slow.

Community
  • 1
  • 1
UserBSS1
  • 2,091
  • 1
  • 28
  • 31

1 Answers1

4

The basic idea that works generally well with PostgreSQL is to create an index on the hash of the set of columns as a whole.

Example:

CREATE INDEX index_name ON tablename (md5((tablename.*)::text));

This will work unless there are columns that don't play well with the requirement of immutability (mostly timestamp with time zone because their cast-to-text value is session-dependent).

Once this index is created, duplicates can be found quickly by self-joining with the hash, with a query looking like this:

SELECT t1.ctid, t2.ctid
FROM tablename t1 JOIN tablename t2
 ON (md5((t1.*)::text) = md5((t2.*)::text))
WHERE t1.ctid > t2.ctid;

You may also use this index to avoid duplicates rows in the future rather than periodically de-duplicating them, by making it UNIQUE (duplicate rows would be rejected at INSERT or UPDATE time).

Daniel Vérité
  • 58,074
  • 15
  • 129
  • 156
  • Didn't work for me, since I have timestamps with TZ, but I like the approach. – Mikhail Lisakov Oct 18 '19 at 11:09
  • I'm trying to do something similar but with a timestamptz. However I know for a fact that the database is UTC and (obviously) so are the timestamps; I'm looking for a workaround to immutability. I want to nuke a few thousand duplicate rows out of millions of rows so I can recreate the primary key... – Jeff Jun 29 '20 at 11:59
  • 2
    @Jeff: if you have a candidate primary key then you don't need the above method. Create an index on it, then eliminate the duplicates with a self-join that presumably will use that index, then drop the index, then set the unique constraint. – Daniel Vérité Jun 29 '20 at 12:11
  • Thanks for the lead! I will try a few things, otherwise I'll create a new question here and let you know. – Jeff Jun 29 '20 at 12:15
  • @DanielVérité Correct me if im wrong but in the example you've given here the index is created on all columns right? how can I do the same thing but create the md5 index only on specific columns? – Omri Shneor Aug 02 '21 at 08:38
  • 1
    @OmriShneor: this answer appears to be obsolete and if you need to deduplicate only on some columns, that's not the same question anyway. Please submit a new question. – Daniel Vérité Aug 02 '21 at 10:04
  • @DanielVérité why is the answer obsolete? – Omri Shneor Aug 02 '21 at 11:38
  • @OmriShneor: because PG says that `tablename.*::text` is not immutable no matter what the structure and refuses to create the index on `md5(tablename.*::text)`. This needs to be worked around. – Daniel Vérité Aug 04 '21 at 07:22