3

I have a table with a few thousand rows. The table contains two columns, name and email. I have several duplicate rows, for example:

  • John Smith | john@smith.com
  • John Smith | john@smith.com
  • Erica Smith | erica@smith.com
  • Erica Smith | erica@smith.com

What would be the easiest way to delete all duplicate results. For example, such that the table's content would = SELECT name, DISTINCT(email) FROM table.

Narnian
  • 3,858
  • 1
  • 26
  • 29
David542
  • 104,438
  • 178
  • 489
  • 842
  • one way to do is using primary key. so you have primary key in your table or just name and email only? – AJP Mar 26 '12 at 21:27

5 Answers5

7

You could pretty easily do this by selecting that query into another table, then renaming it to replace the original.

CREATE TABLE `table2` (
  `name` varchar(255), 
  `email` varchar(255), 
  UNIQUE KEY `email` (`email`));
INSERT INTO `table2` SELECT `name`, DISTINCT(`email`) FROM `table`;
RENAME TABLE `table` TO `table1`;
RENAME TABLE `table2` TO `table`;

Note that this CREATE should be adjusted to your actual table format. I added the unique key on the email field as a suggestion on how you would prevent duplicates in the first place.

Alternatively, you could loop over this

DELETE FROM `table` 
WHERE `email` IN (
  SELECT `email` FROM `table` GROUP BY `email` HAVING count(*) > 1
) LIMIT 1

Which would delete one duplicate record per call. The importance of the limit is to not remove both rows for any duplicate

Umbrella
  • 4,733
  • 2
  • 22
  • 31
2

The easiest way would be to copy all distinct values into a new table:

select distinct *
into NewTable
from MyTable
keyser
  • 18,829
  • 16
  • 59
  • 101
1
DELETE FROM table
WHERE id 
NOT IN
(SELECT A.id
FROM 
(
SELECT name,MAX(id) AS id
FROM table
GROUP BY name
) A
)
Teja
  • 13,214
  • 36
  • 93
  • 155
1

Add an auto-increment field to the table. I believe that when you add it, it will be 'backfilled' for you. Since MySql doesn't allow a delete based on a subquery against the same table, the easiest solution is to then dump the whole dataset into a temptable for use in processing. Assuming you called the new field RowId and the temp table tempTable, you can then use code like this:

DELETE FROM NameAndEmail
LEFT JOIN 
(     SELECT name, email, Max(RowId) as MaxRowId 
      FROM temptable 
      GROUP BY name, email
) as MaxId
WHERE NameAndEmail.Email = MaxId.Email
     and NameAndEmail.Name = MaxId.Name
     and NameAndEmail.RowId <> MaxId.RowId
Rose
  • 156
  • 5
1

Add a unique index

The simplest way to clean up a table with duplicate data - is to just add a unique index:

set session old_alter_table=1;
ALTER IGNORE TABLE `table` ADD UNIQUE INDEX (name, email);

Pay particular attention to the first sql statement, without it the IGNORE flag is ignored and the alter table statement will fail with an error.

Community
  • 1
  • 1
AD7six
  • 63,116
  • 12
  • 91
  • 123