Suppose, in my table with columns
id | email | name
1 | aa@aa.com | aa
2 | aa@aa.com | aaa
3 | bb@b.com | bb
Suppose I want to delete duplicate entries whats the best way to do? (If more than one entry is present, first record is maintained). Also my table is 40GB+ / billions of records.
The options we had were :
1)
ALTER IGNORE TABLE table_name ADD UNIQUE INDEX (email);
It was not even thinkable as we 4hrs downtime was not affortable.
2)
- Identify the ids you have to delete with group by and push to a temp table.
- Inner join base table with temp table and delete the rows
That will create holes and performance issue.
3) Solution we did was,
- Took a dump of table with insert ignore option (was done on slave to reduce the load)
- Restored data to diff DB
- Swapped two tables with rename tables
- Incremental data change was also copied (we had the downtime here) This could be done with few minutes downtime. It worked for me as there was no Updates on the table(we had only inserts).
What is the best solution if update is also supported. My main constrain is the downtime?
May be i can extend 3 by enabling logs on update queries including the table and run the same after restoring.
Is there a better way of doing it?