Get ID's of Mysql Duplicate Rows to Delete

Question

I'm trying to write a Laravel eloquent statement to do the following.

Query a table and get all the ID's of all the duplicate rows (or ideally all the IDs except the ID of the first instance of the duplicate).

Right now I have the following mysql statement:

select `codes`, count(`codes`) as `occurrences`, `customer_id` from `pizzas` 
group by `codes`, `customer_id` 
having `occurrences` > 1;

The duplicates are any row that shares a combination of codes and customer_id, example:

codes,customer_id
183665A4,3
183665A4,3
183665A4,3
183665A4,3
183665A4,3

I'm trying to delete all but 1 of those.

This is returning a set of the codes, with their occurrences and their customer_id, as I only want rows that have both.

Currently I think loop through this, and save the ID of the first instance, and then call this again and delete any without that ID. This seems not very fast, as there's about 50 million rows so each query takes forever and we have multiple queries for each duplicate to delete.

// get every order that shares the same code and customer ID
$orders = Order::select('id', 'codes', DB::raw('count(`codes`) as `occurrences`'), 'customer_id')
            ->groupBy('codes')
            ->groupBy('customer_id')
            ->having('occurrences', '>', 1)
            ->limit(100)
            ->get();

        // loop through those orders
        foreach ($orders as $order)
        {
            // find the first order that matches this duplicate set
            $first_order = Order::where('codes', $order->codes)
                                ->where('customer_id', $order->customer_id)
                                ->first();
            
            // delete all but the first        
            Order::where('codes', $order->codes)
                ->where('customer_id', $order->customer_id)
                ->where('id', '!=', $first_order->id)
                ->delete();
        }

There has got to be a more efficient way to track down all rows that share the same code and customer_id, and delete all the duplicates but keep the first instance, right? lol

I'm thinking maybe if I can add a fake column to the results that is an array of every ID, I could at least then remove the first ID and delete the others.

Surely if you were to join the same table on matching columns but where unique id is not the same you'd find matches without any iteration — Popnoodles, Jul 05 '22 at 18:55
I'm not entirely sure what you mean or how to go about doing that @Popnoodles sorry =( - and the thing is, I don't want to delete the first instance of the duplicates, just the duplications — Octoxan, Jul 05 '22 at 19:02
What exactly is a duplicate? I'm having trouble visualizing your data. What does `codes` contain? Is the `unique` logic a combination of `codes` and `customer_id`? Perhaps you could [edit your question](https://stackoverflow.com/posts/72874206/edit) and include some example rows from your database to help us visualize it. — Tim Lewis, Jul 05 '22 at 19:29
@TimLewis yep that is what should be unique, so no customer_id/codes pair should exist more than once. Just edited the question as well. I have about 30 million records and about 2 million need deleted lol — Octoxan, Jul 05 '22 at 19:35
"*30 million records and about 2 million need deleted*" - Oof... Kinda makes you wish you'd have a `unique` index `codes|customer_id` on that Table eh? My approach with that many records would be the same as yours TBH; in batches, find the duplicate pairs, find the `id` for the first one, and run a single `DELETE FROM pizzas WHERE id IN ?`, where `?` is the list of IDs minus the first one. That statement would then be run X times until all duplicates are nuked. Then _please_ add a unique index. — Tim Lewis, Jul 05 '22 at 19:37
@TimLewis Yeahhh there were already a hundred relationships / other tables referencing just the ID index of this table, and all sorts of other fun going on before the client even added that the could be multiple customers. Is it possible to add a composite index this late after? And still have the old ID index for all the relationships? Or is it one or the other — Octoxan, Jul 05 '22 at 19:57
Haha totally understandable; hindsight is 20:20 It is definitely possible to add a unique constraint to `pizzas` on the `codes` and `custom_id` columns (not sure if composite key is the right term here, I _think_ that's different), but you have to get rid of the duplicates first. If you try to add it, and there's existing duplicates, I'm pretty sure it reverts. It shouldn't have any effect on existing FK references or anything though. — Tim Lewis, Jul 05 '22 at 20:00
Ohhh right right, index, not a key. I swear when this was created in Laravel 4 ages ago there was no way to set unique for a combination of multiple columns but it looks like there is now. Great to know for if/when I ever have this finish purging the couple million rows. Thank you! — Octoxan, Jul 05 '22 at 20:22
@RickJames literally any one of them, the rest of their data populates from an API whenever its missing. — Octoxan, Jul 06 '22 at 02:06
@AD7six and because the table it 140 GB currently and creating a copy of it isn't feasible at this time =| lol — Octoxan, Jul 06 '22 at 02:06
What percentage of the file will be deleted? How much spare disk space do you have? — Rick James, Jul 06 '22 at 02:45

AD7six · Answer 1 · 2022-07-06T08:00:22.670

Don't involve PHP

This seems not very fast

The logic in the question is inherently slow because it's lots of queries and for each query there's:

DB<->PHP network roundtrip
PHP ORM logic/overhead

Given the numbers in the question, the whole code needs calling up to 10k times (if there are exactly 2 occurrences for every one of those 2 million duplicate records), for arguments sake let's say there are 1k sets of duplicates, overall that's:

1,000 queries finding duplicates
100,000 queries finding the first record
100,000 delete queries

201,000 queries is a lot and the php overhead makes it an order of magnitude slower (a guess, based on experience).

Do it directly on the DB

Just eliminating php/orm/network (even if it's on the same machine) time would make the process markedly faster, that would involve writing a procedure to mimic the php logic in the question.

But there's a simpler way, the specifics depend on the circumstances. In comments you've said:

The table is 140GB in size
It contains 50 million rows
Approx 2 million are duplicate records
There isn't enough free space to make a copy of the table

Taking these comments at face value the process I suggest is:

Ensure you have a functional DB backup

Before doing anything make sure you have a functional DB backup. If you manage to make a mistake and e.g. drop the table - be sure you can recover without loss of data.

You'll be testing this process on a copy of the database first anyway, right :) ?

Create a table of "ids to keep" and populate it

This is a permutation of removing duplicate with a unique index:

CREATE TABLE ids_to_keep (
    id INT PRIMARY KEY,
    codes VARCHAR(50) NOT NULL, # use same schema as source table
    customer_id INT NOT NULL, # use same schema as source table
    UNIQUE KEY derp (codes,customer_id)
);

INSERT IGNORE INTO ids_to_keep 
    SELECT id, codes, customer_id from pizzas;

Mysql will silently drop the rows conflicting with the unique index, resulting in a table with one id per codes+customer_id tuple.

If you don't have space for this table - make room :). It shouldn't be too large; 140GB and 50M rows means each row is approx 3kb - this temporary table will likely require single-digit % of the original size.

Delete the duplicate records

Before executing any expected-to-be-slow query use EXPLAIN to check if the query will complete in a reasonable amount of time.

To run as a single query:

DELETE FROM 
  pizzas 
WHERE 
  id NOT IN (SELECT id from ids_to_keep);

If you wish to do things in chunks:

DELETE FROM 
  pizzas 
WHERE 
  id BETWEEN (0,10000) AND
  id NOT IN (SELECT id from ids_to_keep);

Cleanup

Once the table isn't needed any more, get rid of it:

DROP TABLE ids_to_keep;

Make sure this doesn't happen again

To prevent this happening again, add a unique index to the table:

CREATE UNIQUE INDEX ON pizzas(codes, customer_id);

score 0 · Answer 2 · answered Jul 06 '22 at 07:41

0

Try this one it will keep only the duplicate and non-duplicate id lastest id:

   $deleteDuplicates = DB::table('orders as ord1') 
  ->join('orders as ord2', 'ord1.codes', '<', 'ord2.codes') 
   ->where('ord1.codes', '=', 'ord2.codes') ->delete();

answered Jul 06 '22 at 07:41

Waqas Altaf

392
3
16

I haven't used mysql in "some years" does that query work (as there's no php logic involved there just the query would be a better answer IMO :) )? I recall mysql complaining if you try to delete with a join to the same table, and needing to use a named query as a workaround. I suspect with the numbers in the question this query will die before it populates the temporary table it's generating. – AD7six Jul 06 '22 at 07:53