70

I've got big (>Mil rows) MySQL database messed up by duplicates. I think it could be from 1/4 to 1/2 of the whole db filled with them. I need to get rid of them quick (i mean query execution time). Here's how it looks:
id (index) | text1 | text2 | text3
text1 & text2 combination should be unique, if there are any duplicates, only one combination with text3 NOT NULL should remain. Example:

1 | abc | def | NULL  
2 | abc | def | ghi  
3 | abc | def | jkl  
4 | aaa | bbb | NULL  
5 | aaa | bbb | NULL  

...becomes:

1 | abc | def | ghi   #(doesn't realy matter id:2 or id:3 survives)   
2 | aaa | bbb | NULL  #(if there's no NOT NULL text3, NULL will do)

New ids cold be anything, they do not depend on old table ids.
I've tried things like:

CREATE TABLE tmp SELECT text1, text2, text3
FROM my_tbl;
GROUP BY text1, text2;
DROP TABLE my_tbl;
ALTER TABLE tmp RENAME TO my_tbl;

Or SELECT DISTINCT and other variations.
While they work on small databases, query execution time on mine is just huge (never got to the end, actually; > 20 min)

Is there any faster way to do that? Please help me solve this problem.

bizzz
  • 1,795
  • 2
  • 19
  • 21
  • 2
    Please specify: a) is renumbering the id field necessary ? b) what quantity or ratio of duplicates are we expecting? (useful to decide working in-situ or creating new table) c) which indexes exist on the current table. – mjv Oct 30 '09 at 20:12
  • a) renumbering the id field is not necessary b) my estimate: from 1/4 to 1/2 of db are duplicates c) id is the only index. I'll edit question accordingly. – bizzz Oct 31 '09 at 07:17

9 Answers9

150

I believe this will do it, using on duplicate key + ifnull():

create table tmp like yourtable;

alter table tmp add unique (text1, text2);

insert into tmp select * from yourtable 
    on duplicate key update text3=ifnull(text3, values(text3));

rename table yourtable to deleteme, tmp to yourtable;

drop table deleteme;

Should be much faster than anything that requires group by or distinct or a subquery, or even order by. This doesn't even require a filesort, which is going to kill performance on a large temporary table. Will still require a full scan over the original table, but there's no avoiding that.

ʞɔıu
  • 47,148
  • 35
  • 106
  • 149
  • Thanks, it works! 1.2 mil rows became 0.6 mil in 60 minutes, so that's around 10000 rows written per minute. Thanks for the clear explanation too! :) – bizzz Oct 31 '09 at 10:55
  • This was a big help. Thank you – rpearce Jun 23 '12 at 01:31
  • 16
    @ʞɔıu (upsideDownNick) simple and effective. for those who do not care for the text3 not null part, you can use INSERT IGNORE (disconsidering the ON DUPLICATE UPDATE part) and mysql will ignore errors and insert only the first distinct value it finds (ignoring subsequent duplicates). – tony gil Jul 28 '12 at 23:51
  • +1 This is a smart solution. In my case, the client loses connection to server about 10 minutes in (table has 45+ million records) and results in messy locks being open, etc. -- any recommendation for how to handle this? – Matt May 29 '13 at 18:46
  • In case anyone is interested, I expanded the @ʞɔıu response with FURTHER USE CASES on https://stackoverflow.com/questions/3311903/remove-duplicate-rows-in-mysql/47392593#47392593 – César Revert-Gomar Nov 20 '17 at 15:18
95

Found this simple 1-line code to do exactly what I needed:

ALTER IGNORE TABLE dupTest ADD UNIQUE INDEX(a,b);

Taken from: http://mediakey.dk/~cc/mysql-remove-duplicate-entries/

liorq
  • 1,341
  • 10
  • 5
  • 7
    Looks like MySQL bug preventing your query (especially `IGNORE` part) from working: Error Code: 1062 Duplicate entry 'abc-def' for key 'text1' – bizzz Oct 19 '11 at 13:44
  • 12
    @bizzz You just need to run `set session old_alter_table=1` if you get that error, then try again. – Matthew Oct 07 '13 at 21:28
  • This isn't working on BLOB/TEXT column. it gives error "BLOB/TEXT column 'name' used in key specification without a key length" – Dashrath Nov 06 '16 at 10:36
  • As of MySQL 5.7.4, the IGNORE clause for ALTER TABLE is removed and its use produces an error. [Source](https://stackoverflow.com/a/5456599/4409950). – Tyler Jul 16 '23 at 23:30
12
DELETE FROM dups
WHERE id NOT IN(
    SELECT id FROM (
        SELECT DISTINCT id, text1, text2
            FROM dups
        GROUP BY text1, text2
        ORDER BY text3 DESC
    ) as tmp
)

This queries all records, groups by the distinction fields and orders by ID (means we pick the first not null text3 record). Then we select the id's from that result (these are good ids...they wont be deleted) and delete all IDs that AREN'T those.

Any query like this affecting the entire table will be slow. You just need to run it and let it roll out so you can prevent it in the future.

After you have done this "fix" I would apply UNIQUE INDEX (text1, text2) to that table. To prevent the posibility of duplicates in the future.

If you want to go the "create a new table and replace the old one" route. You could use the very inner select statement to create your insert statement.

MySQL specific (assumes new table is named my_tbl2 and has exactly the same structure):

INSERT INTO my_tbl2
    SELECT DISTINCT id, text1, text2, text3
            FROM dups
        GROUP BY text1, text2
        ORDER BY text3 DESC

See MySQL INSERT ... SELECT for more information.

Kevin Peno
  • 9,107
  • 1
  • 33
  • 56
  • Sorry, both your suggestions deleting duplicates, but not choosing the right text3 field to survive (NULLs remain while there are NOT NULL alternatives) – bizzz Oct 31 '09 at 11:06
9

remove duplicates without removing foreign keys

create table tmp like mytable;
ALTER TABLE tmp ADD UNIQUE INDEX(text1, text2, text3, text4, text5, text6);
insert IGNORE into tmp select * from mytable;
delete from mytable where id not in ( select id from tmp);
Gadelkareem
  • 1,067
  • 13
  • 30
3

If you can create a new table, do so with a unique key on the text1 + text2 fields. Then insert into the table ignoring errors (using the INSERT IGNORE syntax):

select * from my_tbl order by text3 desc
  • I think the order by text3 desc will put the NULLs last, but double check that.

Indexes on all those columns could help a lot, but creating them now could be pretty slow.

Scott Saunders
  • 29,840
  • 14
  • 57
  • 64
  • It will put nulls last, but it doesn't satisfy the request which was "keep the first one that does not have a null in text3". to do this you will need to order by ID ASC and add a WHERE text3 IS NOT NULL to your statement. – Kevin Peno Oct 30 '09 at 20:28
  • That's a good point. However that requirement contradicts his sample output: 2 | aaa | bbb | NULL Perhaps he'll tell us what he really wants. – Scott Saunders Oct 30 '09 at 20:40
  • I reread his request. It appears that he doesn't care so long as, if there is a non-null, non-nulls are kept. So your example would suit well. :) – Kevin Peno Oct 30 '09 at 21:09
  • Thanks, working. With a 1.2 mil rows it took almost 3 hours; est. 4000 rows per minute written. It leaves duplicate with biggest text3 field and that's corresponding to my database logic. – bizzz Oct 31 '09 at 10:50
1

For large tables with few duplicates, you may want to avoid copying the whole table to another place. One way is to create a temporary table holding the rows you want to keep (for each key with duplicates), and then delete duplicates from the original table.

An example is given here.

user1931858
  • 10,518
  • 1
  • 17
  • 6
0

I don't have much experience with MySQL. If it has analytic functions try:

delete from my_tbl
 where id in (
     select id 
       from (select id, row_number()
                            over (partition by text1, text2 order by text3 desc) as rn
               from my_tbl
               /* optional: where text1 like 'a%'  */
             ) as t2
       where rn > 1
     )

the optional where clause makes the means you'll have to run it multiple times, one for each letter, etc. Create an index on text1?

Before running this, confirm that "text desc" will sort nulls last in MySQL.

redcayuga
  • 1,241
  • 6
  • 4
  • Sorry, Error Code : 1064 near '(partition by...' – bizzz Oct 31 '09 at 11:10
  • I guess MySql doesn't have analytic functions. I'll try again later. – redcayuga Nov 02 '09 at 21:43
  • can you run: create table dups as SELECT text1, text2 , max(case when text3 is null then 1 else 0) as has_null3 , max(case when text3 is not null then 1 else 0) as has_not_null3 , min(case when text3 is not null then id else null) as pref_id FROM my_tbl GROUP BY text1, text2 having count(*) > 1 This will give us the the list of duplicated text1/2 and some of the "preferred" ids. If it takes too long, and it probably will, add "where text1 like 'a%' " or something like that. – redcayuga Nov 02 '09 at 22:14
0

I know this is an Old thread but I have a somewhat messy method that is much faster and customizable, in terms of speed I'd say 10sec instead of 100sec (10:1).

My method does required all that messy stuff you were trying to avoid:

  • Group by (and Having)
  • group concat with ORDER BY
  • 2 temporary tables
  • using files on disk!
  • somehow (php?) deleting the file after

But when you are talking about MILLIONS (or in my case Tens of Millions) it's worth it.

anyway its not much because comment are in portuguese but here is my sample:

EDIT: if I get comments I'll explain further how it works :)

START TRANSACTION;

DROP temporary table if exists to_delete;

CREATE temporary table to_delete as (
    SELECT
        -- escolhe todos os IDs duplicados menos os que ficam na BD
        -- A ordem de escolha dos IDs é dada por "ORDER BY campo_ordenacao DESC" em que o primeiro é o que fica
        right(
            group_concat(id ORDER BY campos_ordenacao DESC SEPARATOR ','),
            length(group_concat(id ORDER BY campos_ordenacao DESC SEPARATOR ',')) 
                - locate(",",group_concat(id ORDER BY campos_ordenacao DESC SEPARATOR ','))
        ) as ids,

        count(*) as c

    -- Tabela a eliminar duplicados
    FROM teste_dup

    -- campos a usar para identificar  duplicados
    group by test_campo1, test_campo2, teste_campoN
    having count(*) > 1 -- é duplicado
);

-- aumenta o limite desta variável de sistema para o máx 
SET SESSION group_concat_max_len=4294967295;

-- envia os ids todos a eliminar para um ficheiro
select group_concat(ids SEPARATOR ',') from to_delete INTO OUTFILE 'sql.dat';

DROP temporary table if exists del3;
create temporary table del3 as (select CAST(1 as signed) as ix LIMIT 0);

-- insere os ids a eliminar numa tabela temporaria a partir do ficheiro
load data infile 'sql.dat' INTO TABLE del3
LINES TERMINATED BY ',';

alter table del3 add index(ix);

-- elimina os ids seleccionados
DELETE teste_dup -- tabela 
from teste_dup -- tabela

join del3 on id=ix;

COMMIT;
JDuarteDJ
  • 1,073
  • 12
  • 25
0

you can remove all the duplicate entries by using this simple query. that will select all the duplicate records and remove them.

 DELETE i1 
FROM TABLE i1
LEFT JOIN TABLE i2
  ON i1.id = i2.id
 AND i1.colo = i2.customer_invoice_id
 AND i1.id < i2.id
WHERE i2.customer_invoice_id IS NOT NULL