0

I'm trying to remove duplicate rows from a table with millions of rows. The field I want to check for duplicates on is too long (it's storing URLs) to put a UNIQUE index on. Is there any way to remove duplicates quickly?

The recommended method for removing duplicates:

DELETE t1 FROM table1 AS t1 JOIN table1 AS t2 ON t1.id>t2.id AND t1.name=t2.name;

Never seems to finish its job, though I suppose it might just require a lot of time to do.

One idea I've heard here is to create an MD5 hash column for indexing and comparison. Is this the recommended route? If so, should I be truncating this column for space/speed considerations?

Ben G
  • 26,091
  • 34
  • 103
  • 170
  • I would have thought that creating a partial (prefix) index on name would make that query run a lot quicker. – Frederick Cheung May 24 '12 at 22:44
  • Some recommendations at http://stackoverflow.com/questions/7088447/slow-self-join-delete-query – benizi May 24 '12 at 22:44
  • @Frederick but these are URLS, so using partial comparisons may be wrong for comparison? e.g. google.com vs google.com/page – Ben G May 24 '12 at 22:45
  • The db won't be able to purely use the index but jt should be able to use the index to greatly reduce the number of rows where it does actually have to do the comparison – Frederick Cheung May 24 '12 at 22:53
  • @FrederickCheung: I'm curious: how do you create a "prefix index" in MySQL? –  May 24 '12 at 22:57
  • Pass a length to create index : http://dev.mysql.com/doc/refman/5.0/en/create-index.html – Frederick Cheung May 24 '12 at 22:59

2 Answers2

0

The hash would give you a column you could put and index on so then t1.Name = t2.Name would be a far less expensive t1.Hash = t2.Hash. Adding the hash to 1,000,000 records would take a while though.

Another option if this is a one off would be to page the clear out e.g. something like

Where T1 >= 0 and T1 < 10000
Where T1 >= 10001 and T1 < 20000

etc.

Tony Hopkinson
  • 20,172
  • 3
  • 31
  • 39
0

I may not have thought this through, but it's worth a try:

  • Create a column, md5url, and fill it with the md5 of the url (UPDATE table1 SET md5url = MD5(url)
  • Make a (non-unique) index on that column md5url - md5 should be short enough to do so
  • change your statement to:

    DELETE t1 
    FROM table1 AS t1 
         JOIN table1 AS t2 
              ON t1.md5url = t2.md5url 
                 AND t1.name=t2.name 
                 AND t1.id>t2.id;
    

That way, the JOIN condition works primarily on an index. If that indexed column, md5url, fits, then we actually check the URL - because, depending on how many urls you have, it becomes possible at some point, that two URLs may have the same MD5. The third condition is clear - ensuring that only one of the two identical columns is deleted.

I would love to hear whether this works - it makes perfect sense in my head right now ;-)

Omesh
  • 27,801
  • 6
  • 42
  • 51
Lukx
  • 1,283
  • 2
  • 11
  • 20
  • It did work, but I can't say whether it went any faster than it would have w/o md5.. since it took like a whole day. Also, I wonder whether your approach actually saved time-- question is whether MySQL evaluated in a "lazy" way. See http://stackoverflow.com/questions/10747047/is-mysql-logic-evaluation-lazy-short-circuiting-in-join-clause and lemme know what you think – Ben G May 28 '12 at 22:13