2

I have very large database with 1,112,000,000 rows and I have noticed I have lots of duplicated info:

id domain status

1 Domain.com active

2 domain.com active

3 DOMAIN.com active

What is the best way to remove duplicate rows and leave only unique domains ?

Justas Pukys
  • 124
  • 1
  • 1
  • 8
  • New table; `SELECT DISTINCT ...`; delete old table; rename new table. – Sirko Jul 19 '14 at 11:38
  • duplicate of http://stackoverflow.com/questions/2630440/how-to-delete-duplicates-on-mysql-table http://stackoverflow.com/questions/3311903/remove-duplicate-rows-in-mysql – SerhatCan Jul 19 '14 at 11:40
  • Show us some ways and we'll tell you which is best – Strawberry Jul 19 '14 at 11:58
  • Do you have other tables that reference this id column? If yes, you've got to update those tables to that id, that will remain. – VMai Jul 19 '14 at 12:15
  • Yes, I have other tables that is reference to domain id. I think I need lower all domains first, then remove dublicates. – Justas Pukys Jul 19 '14 at 12:20

3 Answers3

0
DELETE FROM 
    domains
WHERE 
    id NOT IN (SELECT domainTable.id FROM (SELECT * from domains) as domainTable group by domainTable.domain)

The concept here is to take leverage of 'GROUP BY'. When the subquery is executed, it'll return only one id per domain (including duplicates). The parent query is simply excluding the 'id' list and therefore, finds the records whose ids are ignored by GROUP BY clause.

(Answer updated and tested)

J A
  • 1,776
  • 1
  • 12
  • 13
0

This is a large table. In such cases, it usually works better to create a new temporary table, truncate the original, and re-insert the data:

create temporary table temp as 
    select max(id), lower(domain), status
    from mytable
    group by lower(domain), status;

truncate table mytable;

insert into mytable(id, domain, status)
    select id, domain, status
    from temp;

This assumes that you can take the big table offline for a while for the re-insertion. Otherwise, you'll need to delete in batches.

If you want to do it as a delete, I would still start with the same temporary table and add an index on id. Then do:

delete m
    from mytable m left join
         temp
         on m.id = temp.id
    where temp.id is null;

You can add a limit clause, such as limit 10000 to run this in batches.

NOTE:

The assumption is that you don't care which of the rows to keep. If you just want the biggest id, then use:

create temporary table temp as 
    select id, lower(domain), status
    from mytable m
    where not exists (select 1
                      from mytable m2
                      where lower(m2.domain) = lower(m.domain) and m2.id > m.id
                     );

To be honest, though, this will have poor performance. I would check your local settings to see if comparisons are case sensitive or not. If they are not, just add an index on mytable(domain, id) for the above query. If they are, then put everything in the same case:

update mytable
    set domain = lower(domain);

And create the index on mytable(domain, id).

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
0

Hi I found a simple method:

If you want to remove duplicate data and remain original data of TableA:

1) create a similar table example: TableB with the same reference

2) Then run SQL query in phpmyadmin:

First query:

INSERT INTO TableB SELECT * FROM TableA  group by column having count(*) >1 

*This to insert the original data of the duplicate data

Second query:

INSERT INTO TableB SELECT * FROM TableA  group by column having count(*) = 1

*This to insert the data that does not belong to duplicate data

So in TableB, you're able to store data without duplicate value.

Racil Hilan
  • 24,690
  • 13
  • 50
  • 55