Delete values with two duplicate values inrespective of which column they are in

Question

I have a mysql table which looks something like this:

id_one     id_two
1          2
2          1
3          2
2          3
4          5
5          4

I want to delete rows with two duplicate values inrespective of which columns they are in so the example would look like this:

id_one     id_two
1          2
3          2
5          4

There are over 12 million rows in total. Any ideas on how I should do this? Php or mysql query would be preferred.

php or mysql query would be preferred. @JW It doesnt actually matter in which order they are in. — user1911058, Jan 06 '13 at 10:50
so that's why you `3,2` instead of `2,3` since it comes first on the result? — John Woo, Jan 06 '13 at 10:52
It doesnt matter. It can be 3,2 or 2,3. But just only one. There isnt any unique keys. — user1911058, Jan 06 '13 at 10:55
this query should be placed in a cron or script ... or should be executed only one time for clean the table? — Luca Rainone, Jan 06 '13 at 11:29

score 1 · Answer 1 · answered Jan 06 '13 at 11:32

1

DELETE a
FROM table1 a
LEFT JOIN
    (
      select id_one, id_two
      from   Table1
      GROUP BY least(id_one, id_two), greatest(id_one, id_two)
    ) b ON a.id_one = b.id_one AND a.id_two = b.id_two
WHERE b.id_two IS NULL

SQLFiddle Demo

answered Jan 06 '13 at 11:32

John Woo

258,903
69
498
492

Note: This would also delete the rows with id_one=id_two. – wildplasser Jan 06 '13 at 12:36
@wildplasser how come it will delete the records when `id_one=id_two`. can you modify this one? – John Woo Jan 06 '13 at 12:49
because for `one=two`, least(one,two) equals greatest(one,two) – wildplasser Jan 06 '13 at 12:51
@wildplasser then why will it delete? my fiddle shows it does not. http://sqlfiddle.com/#!2/b264f/1 – John Woo Jan 06 '13 at 12:52
Because you data does not happen to have records with id_one=id_two. – wildplasser Jan 06 '13 at 12:55
@wildplasser no it has. `2,2` and `6,6`. [see this fiddle link](http://sqlfiddle.com/#!2/b264f/1) – John Woo Jan 06 '13 at 12:57
Ah, I just saw your additions. The thing appears to wotk, but it looks like a very mysql-specific hack. – wildplasser Jan 06 '13 at 15:22
I tested this last night. It took way too long to process this query. It didn't delete any rows from the table and it was going for way too long (over 6 hours) so I decided to quit it. Is there any faster way to do this? Exactly half of these rows should be deleted (6 million from 12 million rows) since we accidently inserted (1,2) and (2,1) when we should have just inserted (1,2) or (2,1) (just an example). – user1911058 Jan 07 '13 at 12:01

score 0 · Answer 2 · edited May 23 '17 at 12:06

0

I would advise a 2-step approach:

Make id_one always the smaller value, i.e., if id_one is larger than id_two then swap their values - consider something like this (taken from here):
```
UPDATE tablename
SET id_one = (@temp:=id_one), id_one = id_two, id_two = @temp
WHERE id_one > id_two
```

Remove the duplicates as described here:

DELETE tablename FROM tablename INNER JOIN
    (SELECT min(primary_key) AS min_id, id_one, id_two FROM tablename
     GROUP BY id_one, id_two
     HAVING count(1) > 1) AS d
ON (d.id_one = tablename.id_one
    AND d.id_two = tablename.id_two
    AND d.min_id <> tablename.primary_key)

(I assume that you will have a primary key on a table that holds 12 million entries.)

Not tested, so please backup your data!

edited May 23 '17 at 12:06

Community

1
1

answered Jan 06 '13 at 11:19

Niko

26,516
9
93
110

I don't have a primary key. – user1911058 Jan 06 '13 at 11:24
Then you should probably add one. `ALTER TABLE tablename ADD id INT PRIMARY KEY AUTO_INCREMENT` - you can remove it after the procedure if you're sure about not needing one. – Niko Jan 06 '13 at 11:26

wildplasser · Answer 3 · 2013-01-09T12:24:34.837

DELETE FROM ztable zt
WHERE zt.id_one > zt.id_two
  AND EXISTS (
    SELECT * 
    FROM ztable tx
    WHERE tx.id_one = zt.id_two
      AND tx.id_two = zt.id_one
    )
    ;

won't work in mysql, because in mysql you cannot reference the table being updated or deleted. Since you want to make a backup copy anyway, you could use that instead in the EXISTS subquery:

CREATE table safetable AS (SELECT * from ztable);

DELETE FROM ztable zt 
WHERE zt.id_one > zt.id_two 
AND EXISTS (
   SELECT * 
   FROM safetable tx
   WHERE tx.id_one = zt.id_two 
     AND tx.id_two = zt.id_one
   );

Delete values with two duplicate values inrespective of which column they are in

3 Answers3