Issue: hundreds of identical (schema) tables. Some of these have some duplicated data that needs to be removed. My usual strategy for this is:
walk list of tables - for each do
create temp table with unique key on all fields
insert ignore select * from old table
truncate original table
insert select * back into original table
drop or clean temp table
For smaller tables this works fine. Unfortunately the tables I'm cleaning often have 100s of millions of records so my jobs and client connections are timing out while I'm running this. (Since there are hundreds of these tables I'm using Perl to walk the list and clean each one. This is where the timeout happens).
Some options I'm looking into:
mysqldump - fast but I don't see how to do the subsequent 'insert ignore' step
into outfile / load infile - also fast but I'm running from a remote host and 'into outfile' creates all the files on the mysql server. Hard to clean up.
do the insert/select in blocks of 100K records - this avoid the db timeout but its pretty slow.
I'm sure there is a better way. Suggestions?