2

I have a large database table (~1 million records) which I need to purge the duplicate records. The table structure is as follows:

|----|-------------|-----|-----|--------------------|
| id | relation_id | foo | bar | timestamp          |
|----|-------------|-----|-----|--------------------|
| 1  | 1           |14.20|0.22 |2019-10-21 14:00:01 |
| 2  | 1           |14.20|0.22 |2019-10-21 14:00:01 |
| 3  | 1           |14.20|0.22 |2019-10-21 14:00:01 |
| 4  | 2           |10.36|0.75 |2019-10-21 14:00:01 |
| 5  | 2           |10.36|0.75 |2019-10-21 14:00:01 |
| 6  | 2           |10.36|0.75 |2019-10-21 14:00:01 |
|----|-------------|-----|-----|--------------------|

As per the example above, there are a lot of records that have the exact same combination of values relation_id, foo, bar and timestamp. I need to create a script that will run to identify the unique values and then delete and duplicate references. So I would end up with something like:

|----|-------------|-----|-----|--------------------|
| id | relation_id | foo | bar | timestamp          |
|----|-------------|-----|-----|--------------------|
| 1  | 1           |14.20|0.22 |2019-10-21 14:00:01 |
| 4  | 2           |10.36|0.75 |2019-10-21 14:00:01 |
|----|-------------|-----|-----|--------------------|

I have tested looping through the relation_id (as there are only 20 unique values) and then running something like this to create a collection of the unique records:

$unique     = collect([]);
$collection = Model::where('relation_id', $relation_id)->chunk(100, function($items) use ($unique) {
    $unique->push($items->unique()->values()->all());
});

From that, I had planned to loop through all of the Model records and delete if the item was not within the $unique collection. Something like this:

Model::chunk(100, function($items) {
    foreach ($items as $item) {
        if(!$unique->contains('id', $item->id)){
            $item->delete;
        }
    }
});

My problem is as the database table is so large, I cannot test if this logic works. Running the first part of the above script (to populate $unique) for a single $relation_id ran in tinker for 30 minutes without yielding results.

I'm relatively confident this isn't the best approach to delete duplicate records as my approach requires multiple queries which I assume could be optimised (which is critical when dealing with such a large table).

So what is the most efficient way to query a database table to check for unique records (based on multiple columns) and delete the duplicate records?

Sheixt
  • 2,546
  • 12
  • 36
  • 65
  • This answer will probably prove useful to you, deleting rows based on multiple unique columns. As an added benefit it will prevent this from happening in the future: [https://stackoverflow.com/a/3312066/798634](https://stackoverflow.com/a/3312066/798634) – Matt K Oct 21 '19 at 19:10

2 Answers2

3

You can allow the database to do the heaving lifting here. You can query the database using GROUP BY, then remove everything that doesn't match your query.

$ids = Model::groupBy(['relation_id', 'foo', 'bar', 'timestamp'])
            ->get(['id'])
            ->all();

This translates to the following SQL:

SELECT id FROM models GROUP BY relation_id, foo, bar, timestamp;

So now $ids is an array of IDs where the other columns are unique ([1, 4]). So you can execute the following to remove all other rows from the DB:

Model::whereNotIn('id', $ids)->delete();

However, since $ids is probably huge, you are likely to hit some upper limit constraints. In that case, you can try using array_chunk() to add multiple whereNotIn clauses to the query:

$query = Model::query();

foreach(array_chunk($ids, 500) as $chunk) {
  $query->whereNotIn('id', $chunk);
}

$query->delete();

I created an SQL Fiddle where you can test this out.

Vince
  • 3,207
  • 1
  • 17
  • 28
  • Thanks for the suggestion, I'm not sure that I can use that approach as I am not looking to get just distinct ids (or any unique single value). I am looking for a combination of values; where the `timestamp`, `foo` and `bar` are the same. But could I join these values into a unique string to create my identifier? – Sheixt Oct 21 '19 at 18:18
  • based on that suggestion, how would I go about creating a temporary join to create the unique identifier? – Sheixt Oct 21 '19 at 18:22
  • 1
    @Sheixt, You're right. See my updated answer, which leverages "GROUP BY" rather than "DISTINCT". This should work for you. I'd recommend testing on a smaller dummy dataset prior to running this on your actual DB. – Vince Oct 21 '19 at 18:52
  • thank you so much for the detailed answer! I'm sure this is the right way to go... I've created a slimmed-down version of the database to test upon and booted up tinker... when I run the first line I get the following error `Syntax error or access violation: 1055 Expression #1 of SELECT list is not in GROUP BY clause and contains nonaggregated column` – Sheixt Oct 21 '19 at 19:06
  • That usually means you have a typo in one of the column names inside the `groupBy([...])` statement. Make sure all of your column names are written ***exactly*** the way they are in the database, including case sensitivity. – Vince Oct 21 '19 at 19:09
  • Yeah, I've triple checked the column names and they are correct in spelling and case sensitivity. What else could effect this? I'm running MySQL 5.7.24... Here is the full error `Syntax error or access violation: 1055 Expression #1 of SELECT list is not in GROUP BY clause and contains nonaggregated column 'database.table.id' which is not functionally dependent on columns in GROUP BY clause; this is incompatible with sql_mode=only_full_group_by (SQL: select `id` from `table` group by `realation_id`, `foo`, `bar`, `timestamp`)'` – Sheixt Oct 21 '19 at 19:20
  • Aha! I forgot about that... Give [this a try](https://stackoverflow.com/a/54618070/3109473). – Vince Oct 21 '19 at 19:23
  • Particularly, you'd want to remove `'ONLY_FULL_GROUP_BY'` from the sql modes array. – Vince Oct 21 '19 at 19:26
  • Sorry I don't follow... If I do a direct copy of the modes into the mysql config it runs without the error... when you say partially what do you mean? – Sheixt Oct 21 '19 at 19:28
  • But I didn't say "partially"... I was just clarifying that the sql mode that causes the error you got is `ONLY_FULL_GROUP_BY`. Removing that from the modes array should solve the issue. Doesn't it? – Vince Oct 21 '19 at 19:30
  • 1
    Haha that would be why I'm confused...! I've been staring at code for too long! Sorry. I'm going to proceed with the test and will confirm but it looks good! – Sheixt Oct 21 '19 at 19:32
  • The logic is definitely sound. Unfortunately, I'm hitting a limit error `General error: 1390 Prepared statement contains too many placeholders` due to the table size. I'm looking to see if I can use `chunk()` to solve that... Progress though! – Sheixt Oct 21 '19 at 19:57
  • 1
    Added another update using [`array_chunk()`](https://www.php.net/manual/en/function.array-chunk.php). Let me know if this gets you past the 1390 error. – Vince Oct 21 '19 at 20:47
  • Yeah, I tried that myself, it gets past the 1390 error but now it's just emptying the entire table, which is baffling! – Sheixt Oct 21 '19 at 21:20
  • Oh just to clarify, I also had to typecast the `$ids` variable to `array` as it returns an object otherwise. – Sheixt Oct 21 '19 at 21:28
  • Actually, `$ids` is a [collection](https://laravel.com/docs/collections). Rather than using a typecast, which can yield unexpected results, you can call `->all()` on `$ids` to get the collection's underlying array. – Vince Oct 21 '19 at 21:32
  • Ah, I thought a collection was a type of object, my mistake. I've tried `foreach(array_chunk($ids->all(), 500) as $chunk)` but that returns the 1390 error again. – Sheixt Oct 21 '19 at 21:36
  • I've tried changing `->get(['id'])` to a typeset array using `->pluck('id')` but that yields the same error. The `array_chunk()` seems to be handling the processing of the ids but the issue seems to lie in the final query that is processed after the `foreach`. Really I need to handle it in batches but it it needs to have the full array list to check against. I'm toying with the idea of adding a column to the database to create a flag for the unique references, then do a simple query delete on all of the non-flagged rows. This isn't ideal obviously... – Sheixt Oct 21 '19 at 21:52
  • Actually, that's not a bad idea. I'm assuming that de-duping the database is something you're doing that you don't expect to have to do again, right? If that's the case, adding a temporary column to mark unique rows is not a bad idea at all. – Vince Oct 21 '19 at 21:54
  • No, it's a tidy up job as a result of a faulty scheduled task. So should only need to run it once on this table and maybe once again on another but in theory, that'll be it. – Sheixt Oct 21 '19 at 21:59
  • I opted to use MySQL in the end. I've accepted your answer but have posted my own for reference. Thanks again for all your help. – Sheixt Oct 23 '19 at 10:59
2

For anyone experiencing a similar issue here, I opted to use MySQL to handle the tidy up of the database as it was far more efficient than loading it into memory using Eloquent.

This is the script I used:

DELETE table_name FROM table_name LEFT OUTER JOIN (
SELECT MIN(ID) AS minID FROM table_name GROUP BY table_name.relation_id, table_name.timestamp
) AS keepRowTable ON table_name.ID = keepRowTable.minID
WHERE keepRowTable.minID IS NULL

I accepted @Vince's answer because his approach works using Laravel but we ran into issues when trying to process such a large dataset. Plus he's a hero for being so responsive in the comments!

Sheixt
  • 2,546
  • 12
  • 36
  • 65