Delete duplicates from mysql

Question

I have very large database with 1,112,000,000 rows and I have noticed I have lots of duplicated info:

id domain status

1 Domain.com active

2 domain.com active

3 DOMAIN.com active

What is the best way to remove duplicate rows and leave only unique domains ?

New table; `SELECT DISTINCT ...`; delete old table; rename new table. — Sirko, Jul 19 '14 at 11:38
duplicate of http://stackoverflow.com/questions/2630440/how-to-delete-duplicates-on-mysql-table http://stackoverflow.com/questions/3311903/remove-duplicate-rows-in-mysql — SerhatCan, Jul 19 '14 at 11:40
Do you have other tables that reference this id column? If yes, you've got to update those tables to that id, that will remain. — VMai, Jul 19 '14 at 12:15
Yes, I have other tables that is reference to domain id. I think I need lower all domains first, then remove dublicates. — Justas Pukys, Jul 19 '14 at 12:20

J A · Answer 1 · 2014-07-19T12:56:24.153

0

DELETE FROM 
    domains
WHERE 
    id NOT IN (SELECT domainTable.id FROM (SELECT * from domains) as domainTable group by domainTable.domain)

The concept here is to take leverage of 'GROUP BY'. When the subquery is executed, it'll return only one id per domain (including duplicates). The parent query is simply excluding the 'id' list and therefore, finds the records whose ids are ignored by GROUP BY clause.

(Answer updated and tested)

edited Jul 19 '14 at 12:56

answered Jul 19 '14 at 11:41

J A

1,776
1
12
13

Got: #1093 - You can't specify target table 'domains' for update in FROM clause – Justas Pukys Jul 19 '14 at 11:50
Can you explain what this code is doing and link to relevant documentation? – Stephen Ostermiller Jul 19 '14 at 12:23
@StephenOstermiller: Did you mean to ask here ;) – J A Jul 19 '14 at 12:27
Yes. A code only answer is not high quality. Please add additional information to your answer. – Stephen Ostermiller Jul 19 '14 at 12:34
@StephenOstermiller The statement seems pretty much self explanatory. Anyway, will add some explanation shortly. – J A Jul 19 '14 at 12:38

Gordon Linoff · Answer 2 · 2014-07-19T11:59:34.420

This is a large table. In such cases, it usually works better to create a new temporary table, truncate the original, and re-insert the data:

create temporary table temp as 
    select max(id), lower(domain), status
    from mytable
    group by lower(domain), status;

truncate table mytable;

insert into mytable(id, domain, status)
    select id, domain, status
    from temp;

This assumes that you can take the big table offline for a while for the re-insertion. Otherwise, you'll need to delete in batches.

If you want to do it as a delete, I would still start with the same temporary table and add an index on id. Then do:

delete m
    from mytable m left join
         temp
         on m.id = temp.id
    where temp.id is null;

You can add a limit clause, such as limit 10000 to run this in batches.

NOTE:

The assumption is that you don't care which of the rows to keep. If you just want the biggest id, then use:

create temporary table temp as 
    select id, lower(domain), status
    from mytable m
    where not exists (select 1
                      from mytable m2
                      where lower(m2.domain) = lower(m.domain) and m2.id > m.id
                     );

To be honest, though, this will have poor performance. I would check your local settings to see if comparisons are case sensitive or not. If they are not, just add an index on mytable(domain, id) for the above query. If they are, then put everything in the same case:

update mytable
    set domain = lower(domain);

And create the index on mytable(domain, id).

I have just copied database to dev server and going to try this. — Justas Pukys, Jul 19 '14 at 12:24

score 0 · Answer 3 · edited Jul 17 '16 at 06:05

Hi I found a simple method:

If you want to remove duplicate data and remain original data of TableA:

1) create a similar table example: TableB with the same reference

2) Then run SQL query in phpmyadmin:

First query:

INSERT INTO TableB SELECT * FROM TableA  group by column having count(*) >1

*This to insert the original data of the duplicate data

Second query:

INSERT INTO TableB SELECT * FROM TableA  group by column having count(*) = 1

*This to insert the data that does not belong to duplicate data

So in TableB, you're able to store data without duplicate value.

Delete duplicates from mysql

3 Answers3