1

I have an import table with user data, and I need to mark the rows with duplicate field values because they should not be imported.

CREATE TABLE `import` (
  ID int(10) unsigned NOT NULL AUTO_INCREMENT,
  method varchar(20) DEFAULT NULL,
  f1 text,
  f2 text,
  PRIMARY KEY (ID)
) ENGINE=MyISAM  DEFAULT CHARSET=utf8;

The field f1 can hold duplicate values. The query to select them works:

SELECT id, a.f1 FROM import a INNER JOIN
(
    SELECT f1 FROM import
    WHERE f1 IS NOT NULL AND f1 != ''
    GROUP BY f1
    HAVING COUNT(id) > 1
) b
ON a.f1 = b.f1

The problem is the outer query to do the update. Here's the whole shebang:

UPDATE import SET method = 'ERR_DUPLICATE' WHERE import.id IN
(
    SELECT id FROM
    (
        SELECT id, a.f1 FROM import a INNER JOIN
        (
            SELECT f1 FROM import
            WHERE f1 IS NOT NULL AND f1 != ''
            GROUP BY f1
            HAVING COUNT(id) > 1
        ) b
        ON a.f1 = b.f1
    ) c
)

That construction is from MySQL: You can't specify target table 'tasks' for update in FROM clause - which is the error I got before. The above query works but takes 0.5 sec. for a 20,000 row table with about 30 duplicates. I'll have to deal with much bigger import tables, so this is a show stopper.

Any ideas how to speed this up?

Community
  • 1
  • 1
Timm
  • 2,488
  • 2
  • 22
  • 25
  • 1
    What indexes do you have defined? – Jonathan Hall Jun 23 '11 at 22:43
  • Good question @Flimzy, there's only the primary key. The other fields are really placeholders and the field with duplicates can be any one of them. Should I try indexing all of them? There's 40 fields in all. – Timm Jun 23 '11 at 22:46
  • I changed f1 to varchar(2048) with an index. Now it takes 2 seconds! – Timm Jun 23 '11 at 22:50
  • 2
    I would suggest indexing any fields you join on (so a.f1 in your example). I'd also suggest looking at the output of EXPLAIN for clues to where to optimize. – Jonathan Hall Jun 23 '11 at 22:50
  • How many of these rows are you going to be dealing with in the long run? Millions, I take it? – Jonathan Hall Jun 23 '11 at 22:52
  • EXPLAIN is not for UPATE queries according to the documentation. There may be a couple of hundred thousand rows per import. The module I'm working on already runs a lot of other queries... – Timm Jun 23 '11 at 22:57
  • The UPDATE portion of your query is going to be pretty constant, however. Of course the addition of more indexes will slow that part down. The part you really have control over is the SELECT portion. So running EXPLAIN on that part will be helpful. – Jonathan Hall Jun 23 '11 at 23:02
  • EXPLAIN says: id select_type table type possible_keys key key_len ref rows Extra 1 PRIMARY ALL NULL NULL NULL NULL 30 2 DERIVED ALL NULL NULL NULL NULL 12 2 DERIVED a ALL NULL NULL NULL NULL 21185 Using where; Using join buffer 3 DERIVED import ALL NULL NULL NULL NULL 21185 Using where; Using temporary; Using filesort Hope you can decypher it. – Timm Jun 23 '11 at 23:07
  • table a: Using where; Using join buffer; table import: Using where; Using temporary; Using filesort – Timm Jun 23 '11 at 23:10
  • How about splitting it into 2 queries: write the IDs of the duplicates into a temporary table and do "UPDATE import SET method = 'ERR_DUPLICATE' WHERE import.id IN (SELECT id FROM temp)"? – Timm Jun 23 '11 at 23:19

1 Answers1

0

Try this modified version:

CREATE TEMPORARY TABLE duplicate_ids 
        SELECT MAX(id) AS id FROM import 
        WHERE f1 IS NOT NULL AND f1 != ''
        GROUP BY f1 ORDER BY NULL
        HAVING COUNT(*) > 1;
UPDATE import SET method = 'ERR_DUPLICATE' WHERE import.id IN(
      SELECT id FROM duplicate_ids
);

This will give you the greater ID record as a duplicate. ORDER BY NULL suppresses the implicit ordering due to the grouping. Further on, because conditions and GROUPing with TEXT columns is inefficient, you can maintain an extra column which will contain a hash code of the text in f1.

ALTER TABLE import ADD COLUMN f1_hash INT UNSIGNED NOT NULL;
ALTER TABLE import ADD KEY(f1_hash);

f1_hash is filled with the value returned by CRC32(f1) (http://dev.mysql.com/doc/refman/5.0/en/mathematical-functions.html#function_crc32) . CRC32 may have collisions so you must ultimately check the f1 column.

CREATE TEMPORARY TABLE duplicate_ids 
SELECT i2.id FROM import i1 JOIN import i2 
ON i2.id<>i1.id AND i1.f1_hash = i2.f2_hash   
AND i1.f1_hash > 0 WHERE i1.f1 = i2.f1

Then perform the UPDATE as before. You certainly do NOT need an INDEX on f1 column, so better remove it because it adds unnecessary overhead.

georgepsarakis
  • 1,927
  • 3
  • 20
  • 24