delete duplicate rows in large postgresql database table

Question

I have a postgresql database with 100 GB size. One of the tables has about half a billion entries. For quick data entry, some of the data was repeated and left to be pruned later. One of the columns can be used to identify the rows as unique.

I found this stackoverflow question which suggested a solution for mysql:

ALTER IGNORE TABLE table_name ADD UNIQUE (location_id, datetime)

Is there anything similar for postgresql?

I tried deleting with group by and row number, my computer runs out of memory after a few hours in both cases.

This is what I get when I try to estimate the number of rows in the table:

SELECT reltuples FROM pg_class WHERE relname = 'orders';
  reltuples  
-------------
 4.38543e+08
(1 row)

Can you show your table DDL and the delete statement you are executing? If you turn your delete into a select count(*) how many rows is your delete targeting? — Kuberchaun, Jul 26 '13 at 01:54
@JustBob select count(1) took more than an hour before I stopped it. — nurettin, Jul 26 '13 at 05:39
My idea was to do your delete in batches. I was curious if your delete was going after 2 rows or 300 million. If you have a nice index column that's a sequence you can make swipes at sets of rows and delete from there. You would have to write a python script or something else to loop delete from table where id between min and max and commit. Do this until you hit the end of the table where min and max change to the range higher than the last but still only going after say 50,000 records. Make sense? — Kuberchaun, Jul 26 '13 at 06:11
@JustBob yes if it comes down to that, I will have to delete duplicates programmatically with a cursor. Trying to understand bma's suggestion right now. I think your idea and his extra column can be used together to delete duplicates. — nurettin, Jul 26 '13 at 06:16
@nurettin My suggestion #1 was predicated on the assumption that you are deleting a good portion of your 100GB table and didn't want to have the bloat associated with a massive delete (or series of deletes). If you only have a few million rows to delete, then do it in batches and run VACUUM after each run. — bma, Jul 26 '13 at 14:41

score 1 · Accepted Answer · answered Jul 25 '13 at 14:49

Two solutions immediately come to mind:

1). Create a new table as select * from source table with a WHERE clause to determine the unique rows. Add the indexes to match the source table, then rename them both in a transaction. Whether or not this will work for you depends on several factors, including amount of free disk space, if the table is in constant use and interruptions to access are permissible, etc. Creating a new table has the benefit of tightly packing your data and indexes, and the table will be smaller than the original because the non-unique rows are omitted.

2). Create a partial unique index over your columns and add a WHERE clause to filter out the non-uniques. For example:

test=# create table t ( col1 int, col2 int, is_unique boolean);
CREATE TABLE

test=# insert into t values (1,2,true), (2,3,true),(2,3,false);
INSERT 0 3

test=# create unique index concurrently t_col1_col2_uidx on t (col1, col2) where is_unique is true;
CREATE INDEX

test=# \d t
        Table "public.t"
  Column   |  Type   | Modifiers 
-----------+---------+-----------
 col1      | integer | 
 col2      | integer | 
 is_unique | boolean | 
Indexes:
    "t_col1_col2_uidx" UNIQUE, btree (col1, col2) WHERE is_unique IS TRUE

delete duplicate rows in large postgresql database table

1 Answers1

Linked