Removing duplicates with unique index

Question

I inserted between two tables fields A,B,C,D, believing I had created a Unique Index on A,B,C,D to prevent duplicates. However I somehow simply made a normal index on those. So duplicates got inserted. It is 20 million record table.

If I change my existing index from normal to unique or simply a add a new unique index for A,B,C,D will the duplicates be removed or will adding fail since unique records exist? I'd test it yet it is 30 mil records and I neither wish to mess the table up or duplicate it.

You need to use the `IGNORE` keyword - otherwise it will fail. Test it on a small test table. — Paul Spiegel, Apr 15 '16 at 12:23
"IGNORE is a MySQL extension to standard SQL. It controls how ALTER TABLE works if there are duplicates on unique keys in the new table or if warnings occur when strict mode is enabled ..." - [ALTER TABLE Syntax](http://dev.mysql.com/doc/refman/5.7/en/alter-table.html) — Paul Spiegel, Apr 15 '16 at 12:51
@PaulSpiegel That makes sense. I tried duplicating Table1, structure only, and adding a unique index, then doing an insert between the original and newly indexed table but it did fail. I did that because changing or adding the unique index on 30 mil records would have taken too long. So now I added "Ignore" after "Insert" and it worked great. — user3649739, Apr 15 '16 at 14:18

score 94 · Accepted Answer · edited Oct 23 '18 at 01:42

If you have duplicates in your table and you use

ALTER TABLE mytable ADD UNIQUE INDEX myindex (A, B, C, D);

the query will fail with Error 1062 (duplicate key).

But if you use IGNORE

-- (only works before MySQL 5.7.4)
ALTER IGNORE TABLE mytable ADD UNIQUE INDEX myindex (A, B, C, D);

the duplicates will be removed. But the documentation doesn't specify which row will be kept:

IGNORE is a MySQL extension to standard SQL. It controls how ALTER TABLE works if there are duplicates on unique keys in the new table or if warnings occur when strict mode is enabled. If IGNORE is not specified, the copy is aborted and rolled back if duplicate-key errors occur. If IGNORE is specified, only one row is used of rows with duplicates on a unique key. The other conflicting rows are deleted. Incorrect values are truncated to the closest matching acceptable value.

As of MySQL 5.7.4, the IGNORE clause for ALTER TABLE is removed and its use produces an error.

(ALTER TABLE Syntax)

If your version is 5.7.4 or greater - you can:

Copy the data into a temporary table (it doesn't technically need to be temporary).
Truncate the original table.
Create the UNIQUE INDEX.
And copy the data back with INSERT IGNORE (which is still available).

CREATE TABLE tmp_data SELECT * FROM mytable;
TRUNCATE TABLE mytable;
ALTER TABLE mytable ADD UNIQUE INDEX myindex (A, B, C, D);
INSERT IGNORE INTO mytable SELECT * from tmp_data;
DROP TABLE tmp_data;

If you use the IGNORE modifier, errors that occur while executing the INSERT statement are ignored. For example, without IGNORE, a row that duplicates an existing UNIQUE index or PRIMARY KEY value in the table causes a duplicate-key error and the statement is aborted. With IGNORE, the row is discarded and no error occurs. Ignored errors generate warnings instead.

(INSERT Syntax)

Also see: INSERT ... SELECT Syntax and Comparison of the IGNORE Keyword and Strict SQL Mode

This is a revelation. I was pulling my hair out trying to get rid of dups before I could modify a unique index to use different columns from previously. Helped me out loads - thanks. — Mitya, Feb 01 '17 at 15:00
Well it appears they "fixed" this functionality so no matter whether you use `IGNORE` or not it still throws a `Can't write; duplicate key in table` — RVP, May 04 '17 at 10:21
@RVP With 5.7.4 and greater you can still copy the data with `INSERT IGNORE ... SELECT ...`. See the update. — Paul Spiegel, May 04 '17 at 16:35

score 4 · Answer 2 · answered Apr 15 '16 at 12:42

if you think there will be duplicates, adding the unique index will fail. first check what duplicates there are:

select * from
(select a,b,c,d,count(*) as n from table_name group by a,b,c,d) x
where x.n > 1

This may be a expensive query on 20M rows, but will get you all duplicate keys that will prevent you from adding the primary index. You could split this up into smaller chunks if you do a where in the subquery: where a='some_value'

For the records retrieved, you will have to change something to make the rows unique. If that is done (query returns 0 rows) you should be safe to add the primary index.

score 3 · Answer 3 · answered May 24 '17 at 15:27

3

Instead of IGNORE you can use ON DUPLICATE KEY UPDATE, which will give you control over which values should prevail.

answered May 24 '17 at 15:27

Oriol Vilaseca

317
3
7

score 0 · Answer 4 · answered Apr 15 '16 at 12:50

To answer your question- adding a UNIQUE constraint on a column that has duplicate values will throw an error.

For example, you can try the following script:

CREATE TABLE `USER` (
  `USER_ID` INT NOT NULL,
  `USERNAME` VARCHAR(45) NOT NULL,
  `NAME` VARCHAR(45) NULL,
  PRIMARY KEY (`USER_ID`));

INSERT INTO USER VALUES(1,'apple', 'woz'),(2,'apple', 'jobs'),
(3,'google', 'sergey'),(4,'google', 'larry');

ALTER TABLE `USER` 
ADD UNIQUE INDEX `USERNAME_UNIQUE` (`USERNAME` ASC);
/*
Operation failed: There was an error while applying the SQL script to the database.
ERROR 1062: Duplicate entry 'apple' for key 'USERNAME_UNIQUE'
*/

Removing duplicates with unique index

4 Answers4

Linked

Related