Deleting completely identical duplicates from db

Question

We have a table in our db with copied data that has completely duplicated many rows. Because the id is also duplicated there is nothing we can use to select just the duplicates. I tried using a limit to only delete 1 but redshift gave a syntax error when trying to use limit.

Any ideas how we can delete just one of two rows that have completely identical information?

Allan Wind · Accepted Answer · 2021-03-03T10:57:17.227

1

Use select distinct to create a new table. Then either truncate & copy the data, or drop the original table and rename the new table to the original name:

create table t2 as select distinct * from t;
truncate t;
insert into t from select * from t2;
drop table t2;

Add column a column with unique values. identity(seed, step) looks interesting.

edited Mar 03 '21 at 10:57

answered Mar 03 '21 at 10:22

Allan Wind

23,068
5
28
38

Thanks, yeah this is one option. I'm a bit worried since the table is 532 million rows though – Tyler Mar 03 '21 at 10:29
Yeah, I figured, added a 2nd option – Allan Wind Mar 03 '21 at 10:34
what would that second option query look like? I've never used row_number(). Does it work with redshift? – Tyler Mar 03 '21 at 10:34
or just get some arbitrary incrementing number using whatever language I'm using? – Tyler Mar 03 '21 at 10:35
Yeah, you can do it in app code. See https://docs.aws.amazon.com/redshift/latest/dg/r_WF_ROW_NUMBER.html – Allan Wind Mar 03 '21 at 10:35
I think I will do a variation of your first suggestion. I will use my app code, load each duplicate then delete all of them by id and reinsert just one row. Thanks! – Tyler Mar 03 '21 at 10:39
https://stackoverflow.com/questions/37582261/deleting-duplicates-rows-from-redshift – Allan Wind Mar 03 '21 at 10:47

Deleting completely identical duplicates from db

1 Answers1