2

We have a situation where duplicate entries have crept into our table with more than 60 million entries (duplicate here implies that all fields, except the AUTO_INCREMENT index field have the same value). We suspect that there are about 2 million duplicate entries in the table. We would like to delete these duplicate entries such that the earliest instances of the duplicate entries are retained.

Let me explain with an illustrative table:

CREATE TABLE people
(
id INT UNSIGNED NOT NULL AUTO_INCREMENT,
name VARCHAR(40) NOT NULL DEFAULT '',
age INT NOT NULL DEFAULT 0,
phrase VARCHAR(40) NOT NULL DEFAULT '',
PRIMARY KEY (id)
);

INSERT INTO people(name, age, phrase) VALUES ('John Doe', 25, 'qwert'), ('William Smith', 19, 'yuiop'),
('Peter Jones', 19, 'yuiop'), ('Ronnie Arbuckle', 32, 'asdfg'), ('Ronnie Arbuckle', 32, 'asdfg'),
('Mary Evans', 18, 'hjklp'), ('Mary Evans', 18, 'hjklpd'), ('John Doe', 25, 'qwert');

SELECT * FROM people;
+----+-----------------+-----+--------+
| id | name            | age | phrase |
+----+-----------------+-----+--------+
|  1 | John Doe        |  25 | qwert  |
|  2 | William Smith   |  19 | yuiop  |
|  3 | Peter Jones     |  19 | yuiop  |
|  4 | Ronnie Arbuckle |  32 | asdfg  |
|  5 | Ronnie Arbuckle |  32 | asdfg  |
|  6 | Mary Evans      |  18 | hjklp  |
|  7 | Mary Evans      |  18 | hjklpd |
|  8 | John Doe        |  25 | qwert  |
+----+-----------------+-----+--------+

We would like to remove duplicate entries so that we get the following output:

SELECT * FROM people;
+----+-----------------+-----+--------+
| id | name            | age | phrase |
+----+-----------------+-----+--------+
|  1 | John Doe        |  25 | qwert  |
|  2 | William Smith   |  19 | yuiop  |
|  3 | Peter Jones     |  19 | yuiop  |
|  4 | Ronnie Arbuckle |  32 | asdfg  |
|  6 | Mary Evans      |  18 | hjklp  |
|  7 | Mary Evans      |  18 | hjklpd |
+----+-----------------+-----+--------+

On smaller sized tables the following approach would work:

CREATE TABLE people_uniq LIKE people;

INSERT INTO people_uniq SELECT * FROM people GROUP BY name, age, phrase;

DROP TABLE people;

RENAME TABLE people_uniq TO people;

SELECT * FROM people;
+----+-----------------+-----+--------+
| id | name            | age | phrase |
+----+-----------------+-----+--------+
|  1 | John Doe        |  25 | qwert  |
|  2 | William Smith   |  19 | yuiop  |
|  3 | Peter Jones     |  19 | yuiop  |
|  4 | Ronnie Arbuckle |  32 | asdfg  |
|  6 | Mary Evans      |  18 | hjklp  |
|  7 | Mary Evans      |  18 | hjklpd |
+----+-----------------+-----+--------+

Kindly suggest a solution that would scale to a table with tens of millions of entries and many more columns. We are using MySQL version 5.6.49.

Sandeep
  • 1,245
  • 1
  • 13
  • 33
  • If you first create an index on `name, age, phrase`, wouldn't that speed up the `SELECT * FROM people GROUP BY name, age, phrase`? Also, you wrote _"We would like to delete these duplicate entries such that the earliest instances of the duplicate entries are retained"_ but the example for smaller sized tables will not necessarily retain the earliest instances of dupes. Is that really a necessary constraint? – kmoser Dec 15 '20 at 04:01

1 Answers1

0

why not deleting duplicates?

DELETE FROM people
where id in (
SELECT MAX(id) 
FROM people 
GROUP BY name, age, phrase
HAVING count(*) > 1
)

if it still takes too much time , you can do it in batch

eshirvana
  • 23,227
  • 3
  • 22
  • 38