I have a large database table (~1 million records) which I need to purge the duplicate records. The table structure is as follows:
|----|-------------|-----|-----|--------------------|
| id | relation_id | foo | bar | timestamp |
|----|-------------|-----|-----|--------------------|
| 1 | 1 |14.20|0.22 |2019-10-21 14:00:01 |
| 2 | 1 |14.20|0.22 |2019-10-21 14:00:01 |
| 3 | 1 |14.20|0.22 |2019-10-21 14:00:01 |
| 4 | 2 |10.36|0.75 |2019-10-21 14:00:01 |
| 5 | 2 |10.36|0.75 |2019-10-21 14:00:01 |
| 6 | 2 |10.36|0.75 |2019-10-21 14:00:01 |
|----|-------------|-----|-----|--------------------|
As per the example above, there are a lot of records that have the exact same combination of values relation_id
, foo
, bar
and timestamp
. I need to create a script that will run to identify the unique values and then delete and duplicate references. So I would end up with something like:
|----|-------------|-----|-----|--------------------|
| id | relation_id | foo | bar | timestamp |
|----|-------------|-----|-----|--------------------|
| 1 | 1 |14.20|0.22 |2019-10-21 14:00:01 |
| 4 | 2 |10.36|0.75 |2019-10-21 14:00:01 |
|----|-------------|-----|-----|--------------------|
I have tested looping through the relation_id
(as there are only 20 unique values) and then running something like this to create a collection of the unique records:
$unique = collect([]);
$collection = Model::where('relation_id', $relation_id)->chunk(100, function($items) use ($unique) {
$unique->push($items->unique()->values()->all());
});
From that, I had planned to loop through all of the Model records and delete if the item was not within the $unique
collection. Something like this:
Model::chunk(100, function($items) {
foreach ($items as $item) {
if(!$unique->contains('id', $item->id)){
$item->delete;
}
}
});
My problem is as the database table is so large, I cannot test if this logic works. Running the first part of the above script (to populate $unique
) for a single $relation_id
ran in tinker for 30 minutes without yielding results.
I'm relatively confident this isn't the best approach to delete duplicate records as my approach requires multiple queries which I assume could be optimised (which is critical when dealing with such a large table).
So what is the most efficient way to query a database table to check for unique records (based on multiple columns) and delete the duplicate records?