0

I have a mysql table which looks something like this:

id_one     id_two
1          2
2          1
3          2
2          3
4          5
5          4

I want to delete rows with two duplicate values inrespective of which columns they are in so the example would look like this:

id_one     id_two
1          2
3          2
5          4

There are over 12 million rows in total. Any ideas on how I should do this? Php or mysql query would be preferred.

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129

3 Answers3

1
DELETE a
FROM table1 a
LEFT JOIN
    (
      select id_one, id_two
      from   Table1
      GROUP BY least(id_one, id_two), greatest(id_one, id_two)
    ) b ON a.id_one = b.id_one AND a.id_two = b.id_two
WHERE b.id_two IS NULL
John Woo
  • 258,903
  • 69
  • 498
  • 492
  • Note: This would also delete the rows with id_one=id_two. – wildplasser Jan 06 '13 at 12:36
  • @wildplasser how come it will delete the records when `id_one=id_two`. can you modify this one? – John Woo Jan 06 '13 at 12:49
  • because for `one=two`, least(one,two) equals greatest(one,two) – wildplasser Jan 06 '13 at 12:51
  • @wildplasser then why will it delete? my fiddle shows it does not. http://sqlfiddle.com/#!2/b264f/1 – John Woo Jan 06 '13 at 12:52
  • Because you data does not happen to have records with id_one=id_two. – wildplasser Jan 06 '13 at 12:55
  • @wildplasser no it has. `2,2` and `6,6`. [see this fiddle link](http://sqlfiddle.com/#!2/b264f/1) – John Woo Jan 06 '13 at 12:57
  • Ah, I just saw your additions. The thing appears to wotk, but it looks like a very mysql-specific hack. – wildplasser Jan 06 '13 at 15:22
  • I tested this last night. It took way too long to process this query. It didn't delete any rows from the table and it was going for way too long (over 6 hours) so I decided to quit it. Is there any faster way to do this? Exactly half of these rows should be deleted (6 million from 12 million rows) since we accidently inserted (1,2) and (2,1) when we should have just inserted (1,2) or (2,1) (just an example). – user1911058 Jan 07 '13 at 12:01
0

I would advise a 2-step approach:

  1. Make id_one always the smaller value, i.e., if id_one is larger than id_two then swap their values - consider something like this (taken from here):

    UPDATE tablename
    SET id_one = (@temp:=id_one), id_one = id_two, id_two = @temp
    WHERE id_one > id_two
    
  2. Remove the duplicates as described here:

    DELETE tablename FROM tablename INNER JOIN
        (SELECT min(primary_key) AS min_id, id_one, id_two FROM tablename
         GROUP BY id_one, id_two
         HAVING count(1) > 1) AS d
    ON (d.id_one = tablename.id_one
        AND d.id_two = tablename.id_two
        AND d.min_id <> tablename.primary_key)
    

    (I assume that you will have a primary key on a table that holds 12 million entries.)

Not tested, so please backup your data!

Community
  • 1
  • 1
Niko
  • 26,516
  • 9
  • 93
  • 110
  • I don't have a primary key. – user1911058 Jan 06 '13 at 11:24
  • Then you should probably add one. `ALTER TABLE tablename ADD id INT PRIMARY KEY AUTO_INCREMENT` - you can remove it after the procedure if you're sure about not needing one. – Niko Jan 06 '13 at 11:26
0
DELETE FROM ztable zt
WHERE zt.id_one > zt.id_two
  AND EXISTS (
    SELECT * 
    FROM ztable tx
    WHERE tx.id_one = zt.id_two
      AND tx.id_two = zt.id_one
    )
    ;

won't work in mysql, because in mysql you cannot reference the table being updated or deleted. Since you want to make a backup copy anyway, you could use that instead in the EXISTS subquery:

CREATE table safetable AS (SELECT * from ztable);

DELETE FROM ztable zt 
WHERE zt.id_one > zt.id_two 
AND EXISTS (
   SELECT * 
   FROM safetable tx
   WHERE tx.id_one = zt.id_two 
     AND tx.id_two = zt.id_one
   );
wildplasser
  • 43,142
  • 8
  • 66
  • 109