0

I'm currently working on a Laravel 8 application, backed by a PostgreSQL database, in which I'm generating a Cost model for various different items. My intention was to record a maximum of one Cost->value per day, per item; however, due to some issues with overlapping jobs and the way in which I was using the updateOrCreate() method, I've ended up with multiple Cost records per day for each item.

I've since fixed the logic so that I'm no longer getting multiple records per day, but I'd like to now go back and clean up all of the duplicate records.

Is there an efficient way to delete all of the duplicate records per item, leaving the newest record for each day, i.e.: leaving no more than one record per item, per day? While I'm sure this seems pretty straight-forward, I can't seem to land on the correct logic either directly in SQL, or through Laravel and PHP.

Maybe relevant info: Currently, there's ~50k records in the table.

Example table

// Example database table migration
Schema::create('costs', function (Blueprint $table) {
    $table->id();
    $table->string('item');
    $table->decimal('value');
    $table->date('created_at');
    $table->timestamp('updated_at');
});

Rough Example (Before)

id,item,value,created_at,updated_at
510,item1,12,2021-07-02,2021-07-02 16:45:17 126.5010838402907751
500,item1,13,2021-07-02,2021-07-02 16:45:05 126.5010838402907751
490,item1,13,2021-07-02,2021-07-02 16:45:01 126.5010838402907751
480,item2,12,2021-07-02,2021-07-02 16:44:59 126.5010838402907751
470,item2,14,2021-07-02,2021-07-02 16:44:55 126.5010838402907751
460,item2,12,2021-07-02,2021-07-02 16:44:54 126.5010838402907751
450,item2,11,2021-07-02,2021-07-02 16:44:53 126.5010838402907751

Rough Example (Desired End-State)

id,item,value,created_at,updated_at
510,item1,12,2021-07-02,2021-07-02 16:45:17 126.5010838402907751
480,item2,12,2021-07-02,2021-07-02 16:44:59 126.5010838402907751

2 Answers2

1

You could use EXISTS():


select * from meuk;

DELETE FROM meuk d
WHERE EXISTS (
        SELECT * FROM meuk x
        WHERE x.item = d.item                           -- same item
        AND x.updated_at::date = d.updated_at::date     -- same date
        AND x.updated_at > d.updated_at                 -- but: more recent
        );

select * from meuk;

Results:


DROP TABLE
CREATE TABLE
COPY 7
VACUUM
 id  | item  | value | created_at |     updated_at      
-----+-------+-------+------------+---------------------
 510 | item1 |    12 | 2021-07-02 | 2021-07-02 16:45:17
 500 | item1 |    13 | 2021-07-02 | 2021-07-02 16:45:05
 490 | item1 |    13 | 2021-07-02 | 2021-07-02 16:45:01
 480 | item2 |    12 | 2021-07-02 | 2021-07-02 16:44:59
 470 | item2 |    14 | 2021-07-02 | 2021-07-02 16:44:55
 460 | item2 |    12 | 2021-07-02 | 2021-07-02 16:44:54
 450 | item2 |    11 | 2021-07-02 | 2021-07-02 16:44:53
(7 rows)

DELETE 5
 id  | item  | value | created_at |     updated_at      
-----+-------+-------+------------+---------------------
 510 | item1 |    12 | 2021-07-02 | 2021-07-02 16:45:17
 480 | item2 |    12 | 2021-07-02 | 2021-07-02 16:44:59
(2 rows)

A different approach, using window functions. The idea is to number all records on the same {item,day} downward, and preserve only the first:


DELETE FROM meuk d
USING (
        SELECT item,updated_at
        , row_number() OVER (PARTITION BY item,updated_at::date 
                             ORDER BY item,updated_at DESC
                             ) rn
        FROM meuk x
        ) xx
WHERE xx.item = d.item
AND xx.updated_at = d.updated_at
AND xx.rn > 1
        ;

Do note that this procedure always involves a self-join: the fate of a record depends on the existence of other records in the same table.

wildplasser
  • 43,142
  • 8
  • 66
  • 109
  • That first option, using `EXISTS()` is exactly what I'm after. Thank you! – MrThePlague Jul 08 '21 at 15:18
  • The second one is *slightly* better, because it handles ties differently. – wildplasser Jul 08 '21 at 15:20
  • Would it make sense to change the query from `AND x.updated_at > d.updated_at` to `AND x.id > d.id` in order to account for ties in the `updated_at` column? – MrThePlague Jul 08 '21 at 18:33
  • 1
    Only if the `id` column has a meaningful order, which I don't know. In that case, it *could* act as a total ordering, or as a tiebreaker for updated_at. – wildplasser Jul 08 '21 at 18:52
0

There's a hairy SQL query here https://stackoverflow.com/a/1313293/1346367 ; the simpler one is based on joining the table with itself on costs1.id < costs2.id. The < or > indicates whether you like to keep the oldest or the newest value. Sadly there is not an easy way (you cannot trust an ORDER BY on a GROUP BY statement If i recall correctly).

Since I cannot explain to you in detail how this query works, I give you a Laravel/PHP solution which is inefficient but comprehensible:

$keepIds = [];
// Loop the table (without Eloquent for performance benefit).
foreach(DB::table('costs')->orderBy('id', 'ASC')->get() as $cost) {
    // Keep overwriting the index such that the last overwrite will contain the end result.
    $keepIds[$cost->item] = $cost->id;
}

// Remove elements that you do not want to keep.
DB::table('costs')->whereNotIn('id', array_values($keepIds))->delete();

I'm not sure if that last query will work properly though with a very big array; it might throw an SQL error.

Note that you can play with the orderBy to chose whether you want to keep the newest or the oldest records.

Flame
  • 6,663
  • 3
  • 33
  • 53