0

Issue: hundreds of identical (schema) tables. Some of these have some duplicated data that needs to be removed. My usual strategy for this is:

walk list of tables - for each do
create temp table with unique key on all fields
insert ignore select * from old table
truncate original table
insert select * back into original table
drop or clean temp table

For smaller tables this works fine. Unfortunately the tables I'm cleaning often have 100s of millions of records so my jobs and client connections are timing out while I'm running this. (Since there are hundreds of these tables I'm using Perl to walk the list and clean each one. This is where the timeout happens).

Some options I'm looking into:

mysqldump - fast but I don't see how to do the subsequent 'insert ignore' step

into outfile / load infile - also fast but I'm running from a remote host and 'into outfile' creates all the files on the mysql server. Hard to clean up.

do the insert/select in blocks of 100K records - this avoid the db timeout but its pretty slow.

I'm sure there is a better way. Suggestions?

Community
  • 1
  • 1
ethrbunny
  • 10,379
  • 9
  • 69
  • 131
  • [Probably this should help.] (http://stackoverflow.com/questions/2385921/deleting-duplicates-from-a-large-table) – tuxuday Apr 19 '12 at 13:48
  • Ill try the 'alter ignore' route but Im betting that my connections will timeout before it finishes. – ethrbunny Apr 19 '12 at 13:50

1 Answers1

1

If an SQL query to find the duplicates can complete without timing out, I think you should be able to do a SELECT with a Count() operator with a WHERE clause that restricts the output to only the rows with duplicate data (Count(DUPEDATA) > 1). The results of this SELECT can be placed INTO a temporary table, which can then be joined with the primary table for the DELETE query.

This approach uses the set-operations strengths of SQL/MySQL -- no need for Perl coding.

Mark Leighton Fisher
  • 5,609
  • 2
  • 18
  • 29
  • I guess that's the problem - any select * will timeout. And I need to know all columns to find the duplicates. – ethrbunny Apr 19 '12 at 16:04