I need to find duplicate entries on 2 columns out of 5 on a table containing 1 billion rows.
In detail:
Duplicate entries on 2 columns means: column a can have repeated entries and column b can have repeated entries, but both columns considered together cannot have repeated entries.
Reason for this:
I need to find out what duplicate record were erroneously inserted during backfill of data as the backfill was done without the table having a primary key.
The table has following columns:
id, warehouse, quantity, date, updated by
The duplicate entries need to be found on id + warehouse.
I tried using
select id, warehouse
from my_table
group by id, warehouse
having count(*) > 1
This does not give me the duplicates on combination of id and warehouse.
I am afraid of doing a self join on the table as the operation would take too long on such a big table.
Please help me figure out the fastest way of getting duplicates.
Also, as an added challenge, i need to delete the duplicate entries (keeping only one record of the duplicate entries in the table). Is there a fast way of doing that on this huge a table.
When I try to set primary key, the query is stuck on "copy to tmp table" step for about 48 hours with a metadata lock on the table preventing any inserts.
Details on database:
Engine InnoDB
Server mysql
RAM 7.5GB