Deleting duplicates from a large table

Question

I have quite a large table with 19 000 000 records, and I have problem with duplicate rows. There's a lot of similar questions even here in SO, but none of them seems to give me a satisfactory answer. Some points to consider:

Row uniqueness is determined by two columns, location_id and datetime.
I'd like to keep the execution time as fast as possible (< 1 hour).
Copying tables is not very feasible as the table is several gigabytes in size.
No need to worry about relations.

As said, every location_id can have only one distinct datetime, and I would like to remove all the duplicate instances. It does not matter which one of them survives, as the data is identical.

Any ideas?

What was wrong with the approach given in http://stackoverflow.com/questions/1585412/sql-to-delete-duplicate-records-in-a-table ? — Mike, Mar 05 '10 at 10:17
@Pentium10, good point, could speed things up, but could also kill performance if I go through with the subquery solution. — Tatu Ulmanen, Mar 05 '10 at 10:25
Try generating the delete SQL ids first, without issuing a delete, when you have indexes. After you remove indexes, you run the longer plain text SQL having a bunch of IDs — Pentium10, Mar 05 '10 at 10:29

score 16 · Accepted Answer · edited Nov 08 '12 at 16:07

16

I think you can use this query to delete the duplicate records from the table

ALTER IGNORE TABLE table_name ADD UNIQUE (location_id, datetime)

Before doing this, just test with some sample data first..and then Try this....

Note: On version 5.5, it works on MyISAM but not InnoDB.

edited Nov 08 '12 at 16:07

shadyabhi

16,675
26
80
131

answered Mar 05 '10 at 10:32

Vinodkumar SC

323
2
15

1

This looks promising, I hadn't heard about this feature before. Trying it now, I'll let you know how it turns out. And welcome to SO :) – Tatu Ulmanen Mar 05 '10 at 11:19
7

This worked, thank you. Took 31 minutes to go through 16 982 040 rows with 1 589 908 duplicates. I can't believe it could be this simple, with no additional tables or complex queries. :) – Tatu Ulmanen Mar 05 '10 at 12:10
@Vinodkumar Saravana, I'm running 5.5 with InnoDB, I read your note but I tried it anyways just to be sure. (Of course it did not work), but can you please explain why it does not work on InnoDB? – tixastronauta Apr 04 '13 at 08:21
1

@tixastronauta - Seems there is some bug with the InnoDB version of MysQL. Still, you can do by converting the table from InnoDB to MyIsam and then apply alter ignore query. And then again convert to InnoDB. But take backup before converting it. or You can use set session old_alter_table=1; Ref: http://dev.mysql.com/doc/refman/5.1/en/server-options.html – Vinodkumar SC Apr 17 '13 at 12:05
'Alter Ignore table' doesn't work from mysql 5.7 above. Maybe it does just not work on InnoDB table. You may convert your table into MyISAM first and then remove duplicates and back it to InnoDB using: https://stackoverflow.com/a/23421788 – Mohsen Abasi Jul 12 '17 at 11:35

score 1 · Answer 2 · edited Mar 05 '10 at 10:48

1

SELECT *, COUNT(*) AS Count
FROM table
GROUP BY location_id, datetime
HAVING Count > 2

edited Mar 05 '10 at 10:48

Patrick

15,702
1
39
39

answered Mar 05 '10 at 10:19

Sjoerd

74,049
16
131
175

Keith · Answer 3 · 2014-01-07T08:20:43.720

0

This query works perfectly for every case : tested for Engine : MyIsam for 2 million rows.

ALTER IGNORE TABLE table_name ADD UNIQUE (location_id, datetime)

edited Jan 07 '14 at 08:20

answered Jan 07 '14 at 08:13

Keith

1
1

score 0 · Answer 4 · answered Mar 05 '10 at 10:25

UPDATE table SET datetime  = null 
WHERE location_id IN (
SELECT location_id 
FROM table as tableBis
WHERE tableBis.location_id = table.location_id
AND table.datetime > tableBis.datetime)

SELECT * INTO tableCopyWithNoDuplicate FROM table WHERE datetime is not null

DROp TABLE table 

RENAME tableCopyWithNoDuplicate to table

So you keep the line with the lower datetime. I'm not sure about perf, it depends on your table column, your server etc...

score 0 · Answer 5 · answered Jul 12 '17 at 11:45

You can delete duplicates using these steps: 1- Export the following query's results into a txt file:

select dup_col from table1 group by dup_col having count(dup_col) > 1

2- Add this to the first of above txt file and run the final query:

delete from table1 where dup_col in (.....)

Please note that '...' is the contents of txt file created in the first step.

Deleting duplicates from a large table

5 Answers5

Linked