I'm trying to remove duplicate rows from a table with millions of rows. The field I want to check for duplicates on is too long (it's storing URLs) to put a UNIQUE
index on. Is there any way to remove duplicates quickly?
The recommended method for removing duplicates:
DELETE t1 FROM table1 AS t1 JOIN table1 AS t2 ON t1.id>t2.id AND t1.name=t2.name;
Never seems to finish its job, though I suppose it might just require a lot of time to do.
One idea I've heard here is to create an MD5
hash column for indexing and comparison. Is this the recommended route? If so, should I be truncating this column for space/speed considerations?