Delete duplicates records from large Laravel collection

Question

I have a large database table (~1 million records) which I need to purge the duplicate records. The table structure is as follows:

|----|-------------|-----|-----|--------------------|
| id | relation_id | foo | bar | timestamp          |
|----|-------------|-----|-----|--------------------|
| 1  | 1           |14.20|0.22 |2019-10-21 14:00:01 |
| 2  | 1           |14.20|0.22 |2019-10-21 14:00:01 |
| 3  | 1           |14.20|0.22 |2019-10-21 14:00:01 |
| 4  | 2           |10.36|0.75 |2019-10-21 14:00:01 |
| 5  | 2           |10.36|0.75 |2019-10-21 14:00:01 |
| 6  | 2           |10.36|0.75 |2019-10-21 14:00:01 |
|----|-------------|-----|-----|--------------------|

As per the example above, there are a lot of records that have the exact same combination of values relation_id, foo, bar and timestamp. I need to create a script that will run to identify the unique values and then delete and duplicate references. So I would end up with something like:

|----|-------------|-----|-----|--------------------|
| id | relation_id | foo | bar | timestamp          |
|----|-------------|-----|-----|--------------------|
| 1  | 1           |14.20|0.22 |2019-10-21 14:00:01 |
| 4  | 2           |10.36|0.75 |2019-10-21 14:00:01 |
|----|-------------|-----|-----|--------------------|

I have tested looping through the relation_id (as there are only 20 unique values) and then running something like this to create a collection of the unique records:

$unique     = collect([]);
$collection = Model::where('relation_id', $relation_id)->chunk(100, function($items) use ($unique) {
    $unique->push($items->unique()->values()->all());
});

From that, I had planned to loop through all of the Model records and delete if the item was not within the $unique collection. Something like this:

Model::chunk(100, function($items) {
    foreach ($items as $item) {
        if(!$unique->contains('id', $item->id)){
            $item->delete;
        }
    }
});

My problem is as the database table is so large, I cannot test if this logic works. Running the first part of the above script (to populate $unique) for a single $relation_id ran in tinker for 30 minutes without yielding results.

I'm relatively confident this isn't the best approach to delete duplicate records as my approach requires multiple queries which I assume could be optimised (which is critical when dealing with such a large table).

So what is the most efficient way to query a database table to check for unique records (based on multiple columns) and delete the duplicate records?

This answer will probably prove useful to you, deleting rows based on multiple unique columns. As an added benefit it will prevent this from happening in the future: [https://stackoverflow.com/a/3312066/798634](https://stackoverflow.com/a/3312066/798634) — Matt K, Oct 21 '19 at 19:10

Vince · Accepted Answer · 2019-10-23T16:11:06.273

3

You can allow the database to do the heaving lifting here. You can query the database using GROUP BY, then remove everything that doesn't match your query.

$ids = Model::groupBy(['relation_id', 'foo', 'bar', 'timestamp'])
            ->get(['id'])
            ->all();

This translates to the following SQL:

SELECT id FROM models GROUP BY relation_id, foo, bar, timestamp;

So now $ids is an array of IDs where the other columns are unique ([1, 4]). So you can execute the following to remove all other rows from the DB:

Model::whereNotIn('id', $ids)->delete();

However, since $ids is probably huge, you are likely to hit some upper limit constraints. In that case, you can try using array_chunk() to add multiple whereNotIn clauses to the query:

$query = Model::query();

foreach(array_chunk($ids, 500) as $chunk) {
  $query->whereNotIn('id', $chunk);
}

$query->delete();

I created an SQL Fiddle where you can test this out.

edited Oct 23 '19 at 16:11

answered Oct 21 '19 at 18:12

Vince

3,207
1
17
28

Thanks for the suggestion, I'm not sure that I can use that approach as I am not looking to get just distinct ids (or any unique single value). I am looking for a combination of values; where the `timestamp`, `foo` and `bar` are the same. But could I join these values into a unique string to create my identifier? – Sheixt Oct 21 '19 at 18:18
based on that suggestion, how would I go about creating a temporary join to create the unique identifier? – Sheixt Oct 21 '19 at 18:22
1

@Sheixt, You're right. See my updated answer, which leverages "GROUP BY" rather than "DISTINCT". This should work for you. I'd recommend testing on a smaller dummy dataset prior to running this on your actual DB. – Vince Oct 21 '19 at 18:52
thank you so much for the detailed answer! I'm sure this is the right way to go... I've created a slimmed-down version of the database to test upon and booted up tinker... when I run the first line I get the following error `Syntax error or access violation: 1055 Expression #1 of SELECT list is not in GROUP BY clause and contains nonaggregated column` – Sheixt Oct 21 '19 at 19:06
That usually means you have a typo in one of the column names inside the `groupBy([...])` statement. Make sure all of your column names are written ***exactly*** the way they are in the database, including case sensitivity. – Vince Oct 21 '19 at 19:09
Yeah, I've triple checked the column names and they are correct in spelling and case sensitivity. What else could effect this? I'm running MySQL 5.7.24... Here is the full error `Syntax error or access violation: 1055 Expression #1 of SELECT list is not in GROUP BY clause and contains nonaggregated column 'database.table.id' which is not functionally dependent on columns in GROUP BY clause; this is incompatible with sql_mode=only_full_group_by (SQL: select `id` from `table` group by `realation_id`, `foo`, `bar`, `timestamp`)'` – Sheixt Oct 21 '19 at 19:20
Aha! I forgot about that... Give [this a try](https://stackoverflow.com/a/54618070/3109473). – Vince Oct 21 '19 at 19:23
Particularly, you'd want to remove `'ONLY_FULL_GROUP_BY'` from the sql modes array. – Vince Oct 21 '19 at 19:26
Sorry I don't follow... If I do a direct copy of the modes into the mysql config it runs without the error... when you say partially what do you mean? – Sheixt Oct 21 '19 at 19:28
But I didn't say "partially"... I was just clarifying that the sql mode that causes the error you got is `ONLY_FULL_GROUP_BY`. Removing that from the modes array should solve the issue. Doesn't it? – Vince Oct 21 '19 at 19:30
1

Haha that would be why I'm confused...! I've been staring at code for too long! Sorry. I'm going to proceed with the test and will confirm but it looks good! – Sheixt Oct 21 '19 at 19:32
The logic is definitely sound. Unfortunately, I'm hitting a limit error `General error: 1390 Prepared statement contains too many placeholders` due to the table size. I'm looking to see if I can use `chunk()` to solve that... Progress though! – Sheixt Oct 21 '19 at 19:57
1

Added another update using [`array_chunk()`](https://www.php.net/manual/en/function.array-chunk.php). Let me know if this gets you past the 1390 error. – Vince Oct 21 '19 at 20:47
Yeah, I tried that myself, it gets past the 1390 error but now it's just emptying the entire table, which is baffling! – Sheixt Oct 21 '19 at 21:20
Oh just to clarify, I also had to typecast the `$ids` variable to `array` as it returns an object otherwise. – Sheixt Oct 21 '19 at 21:28
Actually, `$ids` is a [collection](https://laravel.com/docs/collections). Rather than using a typecast, which can yield unexpected results, you can call `->all()` on `$ids` to get the collection's underlying array. – Vince Oct 21 '19 at 21:32
Ah, I thought a collection was a type of object, my mistake. I've tried `foreach(array_chunk($ids->all(), 500) as $chunk)` but that returns the 1390 error again. – Sheixt Oct 21 '19 at 21:36
I've tried changing `->get(['id'])` to a typeset array using `->pluck('id')` but that yields the same error. The `array_chunk()` seems to be handling the processing of the ids but the issue seems to lie in the final query that is processed after the `foreach`. Really I need to handle it in batches but it it needs to have the full array list to check against. I'm toying with the idea of adding a column to the database to create a flag for the unique references, then do a simple query delete on all of the non-flagged rows. This isn't ideal obviously... – Sheixt Oct 21 '19 at 21:52
Actually, that's not a bad idea. I'm assuming that de-duping the database is something you're doing that you don't expect to have to do again, right? If that's the case, adding a temporary column to mark unique rows is not a bad idea at all. – Vince Oct 21 '19 at 21:54
No, it's a tidy up job as a result of a faulty scheduled task. So should only need to run it once on this table and maybe once again on another but in theory, that'll be it. – Sheixt Oct 21 '19 at 21:59
I opted to use MySQL in the end. I've accepted your answer but have posted my own for reference. Thanks again for all your help. – Sheixt Oct 23 '19 at 10:59

score 2 · Answer 2 · answered Oct 23 '19 at 10:58

For anyone experiencing a similar issue here, I opted to use MySQL to handle the tidy up of the database as it was far more efficient than loading it into memory using Eloquent.

This is the script I used:

DELETE table_name FROM table_name LEFT OUTER JOIN (
SELECT MIN(ID) AS minID FROM table_name GROUP BY table_name.relation_id, table_name.timestamp
) AS keepRowTable ON table_name.ID = keepRowTable.minID
WHERE keepRowTable.minID IS NULL

I accepted @Vince's answer because his approach works using Laravel but we ran into issues when trying to process such a large dataset. Plus he's a hero for being so responsive in the comments!

this is great and yet simple, thank you @Sheixt – calm Aug 12 '21 at 12:12 — calm, Aug 12 '21 at 12:12

Delete duplicates records from large Laravel collection

2 Answers2