How to remove duplicate items in MySQL with a dataset of 20 million rows?

Question

I've got big MySQL database. I need to delete the duplicate item quickly. Here's how it looks:

id | text1 | text2|    
1  | 23    |  43  |   
2  | 23    |  44  |  
3  | 23    |  44  |

After the deleting, the remain part of table should be:

id | text1 | text2|   
1  | 23    |  43  |   
3  | 23    |  44  |

I don't care about the id. the most important is no duplicate items will be disappear.

"Not caring" is sometimes indicative of larger problems! You *should* care! — Strawberry, Apr 02 '14 at 12:45
Id doesn't mean any thing in my project. It can be any number. The only thing I want is to remove all of duplicate items — Ding Ding, Apr 02 '14 at 12:46
Possible duplicate http://stackoverflow.com/questions/4685173/delete-all-duplicate-rows-except-for-one-in-mysql — Abhik Chakraborty, Apr 02 '14 at 12:48
I think Strawberry means that you should care about this at the beginning of your project to avoid all duplicated items! — JGeo, Apr 02 '14 at 12:48
Couldn't you just prevent duplicates so you'd not need to remove them? Define unique index across (text1, text2). — vhu, Apr 02 '14 at 12:48

score 5 · Accepted Answer · answered Apr 02 '14 at 12:48

5

You may try this:

ALTER IGNORE TABLE my_tablename ADD UNIQUE INDEX idx_name (text1 , text2);

ie, try to add UNIQUE INDEX to your columns and alter the table

This has an advantage that in future also there will be no duplicate rows which you can insert in your table

answered Apr 02 '14 at 12:48

Rahul Tripathi

I've never tried this - out of interest, does the `IGNORE` clause mean that dup rows are silently ignored? +1 – halfer Apr 02 '14 at 12:51
@halfer:- Yes for sure! – Rahul Tripathi Apr 02 '14 at 12:52

score 1 · Answer 2 · answered Apr 02 '14 at 12:48

1

DELETE FROM t WHERE id NOT IN
(SELECT MIN(id) FROM t GROUP BY text1, text2)

answered Apr 02 '14 at 12:48

Olexa

score 0 · Answer 3 · answered Apr 02 '14 at 12:48

0

Run this:

SELECT COUNT(*), text1, text2
GROUP BY text1, text2
HAVING COUNT(*) > 1;

When you find rows here, delete one row for each match, and then run it again.

I'm not sure what it will be like in terms of performance - perhaps it doesn't matter, if you do this offline?

answered Apr 02 '14 at 12:48

halfer

2

group by is not efficient. it will takes a long time. – Ding Ding Apr 02 '14 at 12:49
Are you sure that matters, in this case? In many cases of "it needs to be fast", it doesn't need to be. Can you do the query offline? – halfer Apr 02 '14 at 12:50

3 Answers3