Delete duplicate rows from large (> 100 MIo) postgresql table (truncate with condition?)

Question

Like here, I have a large table which stores all events in our systems, for one event type I have duplicate rows (mistakenly exported from another system several times). I need to delete them to clear out stats. The solution proposed above was to

insert the records -- without duplicates -- into a temporary table,
truncate the original table and insert them back in.

But in my situation I need to delete only one class of events, not all rows, which is impossible with truncate. I'm wondering whether or not I can benefit from postgres USING syntax like in this SO answer , which offers the following solution -

DELETE FROM user_accounts 
USING user_accounts ua2   
WHERE user_accounts.email = ua2.email AND user_account.id < ua2.id;

The problem is that I don't have id field in this large table. So what will be the fastest decision in this situation? DELETE + INSERT from temporary table is the only option?

score 5 · Accepted Answer · 2014-04-02T06:33:23.953

You could use the ctid column as a "replacement id":

DELETE FROM user_accounts 
USING user_accounts ua2   
WHERE user_accounts.email = ua2.email 
  AND user_account.ctid < ua2.ctid;

Although that raises another question: why doesn't your user_accounts table have a primary key?

But if you delete a substantial part of the rows in the table then delete will never be very efficient (and the comparison on ctid isn't a quick one either because it does not have an index). So the delete will most probably take a very long time.

For a one time operation and if you need to delete many rows, then inserting those you want to keep into an intermediate table is going to be much faster.

That method can be improved by simply keeping the intermediate table instead of copying the rows back to the original table.

-- this will create the same table including indexes and not null constraint
-- but NOT foreign key constraints!
create table temp (like user_accounts including all);

insert into temp 
select distinct ... -- this is your query that removes the duplicates
from user_accounts;

 -- you might need cascade if the table is referenced by others
drop table user_accounts;

alter table temp rename to user_accounts;

commit;

The only drawback is that you have to re-create foreign keys for the original table (fks referencing the original table and foreign keys from the original table to a different one).

the example I used was not my table but code used in SO answer. Don't have id field, so will use your solution with intermediate table. But I can't actually drop table, so I need to delete duplicate rows from it, which is very slow.. Hoped for some miracle solution :) — RInatM, Apr 02 '14 at 07:01
@RInatM: why can't you drop the table? Is it constantly being accessed? Are you aware that the create/drop/insert/rename can be done in a single *transaction* in Postgres so other session won't see the change until it is committed? — , Apr 02 '14 at 07:05
Maybe worth the link which describes, *every* table has a `ctid` column. http://www.postgresql.org/docs/9.3/static/ddl-system-columns.html — pozs, Apr 02 '14 at 08:44
@pozs thanks, I didn't know about that. Yes, it's contantly being accessed and lots of views depend on it. And it's huge - I need to delete about 5M rows for one type of events, but together with others this table consists more than 100M rows. — RInatM, Apr 02 '14 at 10:43

Delete duplicate rows from large (> 100 MIo) postgresql table (truncate with condition?)

1 Answers1