0

Suppose, in my table with columns

id | email     | name
1  | aa@aa.com | aa
2  | aa@aa.com | aaa
3  | bb@b.com  | bb

Suppose I want to delete duplicate entries whats the best way to do? (If more than one entry is present, first record is maintained). Also my table is 40GB+ / billions of records.

The options we had were :

1)

ALTER IGNORE TABLE table_name ADD UNIQUE INDEX (email);

It was not even thinkable as we 4hrs downtime was not affortable.

2)

  • Identify the ids you have to delete with group by and push to a temp table.
  • Inner join base table with temp table and delete the rows

That will create holes and performance issue.

3) Solution we did was,

  • Took a dump of table with insert ignore option (was done on slave to reduce the load)
  • Restored data to diff DB
  • Swapped two tables with rename tables
  • Incremental data change was also copied (we had the downtime here) This could be done with few minutes downtime. It worked for me as there was no Updates on the table(we had only inserts).

What is the best solution if update is also supported. My main constrain is the downtime?

May be i can extend 3 by enabling logs on update queries including the table and run the same after restoring.

Is there a better way of doing it?

georgecj11
  • 1,600
  • 15
  • 22
  • Your table is 40GB? I'm guessing it's not really about email addresses!?! – Strawberry Mar 15 '13 at 16:46
  • why should holes matter? if your code depends on ids being sequential with no holes, then your code is badly constructed. Option 2 boils down to this answer: http://stackoverflow.com/questions/4685173/delete-all-duplicate-rows-except-for-one-in-mysql – Marc B Mar 15 '13 at 16:46
  • @Strawberry : Obviously, i m saving lot more than that. – georgecj11 Mar 15 '13 at 16:52
  • @MarcB: My actual table has 2-3 index including composite indexes. Most of my queries are not on id. And what ever index you use, according to my understanding, tables with holes will show performance issue. ANALYZE TABLE can help though – georgecj11 Mar 15 '13 at 16:55

1 Answers1

0

Use mysqldump and export the whole table into a text file WHICH IS "|" SEPARATED.

Looks like below,

table.dat

1|aa@aa.com|aa
2|aa@aa.com|aaa
3|bb@b.com|bb

Lets say table.dat has 1 billion records.

  1. Split the table.dat file to 1000 sub files.
  2. Use AWK, SED, SHELL, PERL or RUBY (I love ruby) and remove the duplicate lines from all the 1000 files in PARALLEL. See "background process in unix"
  3. Integrate the 1000 files into 1 dat file.
  4. Remove the duplicates again.(A little bit inefficient, this part, think and optimize)
  5. Make sure your final dat file doesn't have any duplicate lines.

Load the final.dat into the table!

This can be a little quicker! Anyway don't stop looking for the best way.

beck03076
  • 3,268
  • 2
  • 27
  • 37
  • This will work only if update is not involved in my table during the whole activity. Its somewhat same to what we did, except for we directly used mysql itself to solve our problem. – georgecj11 Mar 16 '13 at 04:07