0

My table structure is as described in this post:

 name | version | processed | processing | updated  | ref_time 
------+---------+-----------+------------+----------+----------
 abc  |       1 | t         | f          | 27794395 | 27794160
 def  |       1 | t         | f          | 27794395 | 27793440
 ghi  |       1 | t         | f          | 27794395 | 27793440
 jkl  |       1 | f         | f          | 27794395 | 27794160
 mno  |       1 | t         | f          | 27794395 | 27793440
 pqr  |       1 | f         | t          | 27794395 | 27794160

Based on this answer, I am deriving a list of ref_time values which I want to use as a basis for deleting 'old' entries from status_table.

This is the query to generate the list of relevant ref_time values:

WITH main AS
(
    SELECT ref_time,
        ROUND(AVG(processed::int) * 100, 1) percent
    FROM status_table
    GROUP BY ref_time ORDER BY ref_time DESC, percent DESC
)
SELECT ref_time FROM main WHERE percent=100 OFFSET 2;

For example this might return:

 ref_time 
----------
 27794880
 27794160

I can then use this to DELETE all relevant entries in the status_table:

DELETE FROM status_table
WHERE ref_time IN 
(
    WITH main AS
    (
        SELECT ref_time,
            ROUND(AVG(processed::int) * 100, 1) percent
        FROM status_table
        GROUP BY ref_time ORDER BY ref_time DESC, percent DESC
    )
    SELECT ref_time FROM main WHERE percent=100 OFFSET 2
);

But I have another table named data_table, which also has a ref_time column, and I want to DELETE entries from that table on the same basis, i.e. any rows having ref_time in the above list.

How do I achieve this without duplicating the query used to generate the ref_time list?

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
drmrbrewer
  • 11,491
  • 21
  • 85
  • 181

1 Answers1

2

You can use common table expressions:

with 
    ref as (
        select ref_time 
        from status_table 
        group by ref_time 
        having bool_and(processed)
        order by ref_time desc limit 2
    ),
    del_ref as (
        delete from status_table s
        using ref r
        where s.ref_time = r.ref_time
    )
delete from data_table d
using ref r
where d.ref_time = r.ref_time

The first CTE,ref, returns the list of timestamps that you want to delete from the two other tables. I attempted to simplify the logic: you seem to want the top 2 timestamps that are fully processed (note that offset skips that many rows from the resultset, which is different than limit).

The second CTE deletes from status_table, and the last part of the query addresses data_table.

GMB
  • 216,147
  • 25
  • 84
  • 135
  • This is really useful, thanks. Regarding `limit` vs `offset` I think it is `offset` that I need, because I want to delete anything *other than* the most recent two fully processed batches... hence `offset` to skip over the top two (the ones I want to keep) when forming the list for `DELETE`... but no matter... it's the overall structure that matters, and it's really useful to see how it can be done. Out of interest, you seem to prefer putting everything in lower-case, even internal functions... is that personal preference or a recent trend? Thanks! – drmrbrewer Nov 06 '22 at 20:44
  • And it seems that `del_ref` is defined but not used? – drmrbrewer Nov 06 '22 at 20:46
  • I created a demo based on your answer here: https://dbfiddle.uk/mJc0HLNT (keeping only the top one fully processed batch). – drmrbrewer Nov 06 '22 at 21:06
  • 1
    @drmrbrewer: ok, so you want `offset` indeed. As for the lower case, that's just personal taste, and there is no "standard" here. `del_ref` is a CTE, so it needs to be named - but indeed it is not used afterwards. – GMB Nov 06 '22 at 21:09
  • So is `del_ref` defined as a CTE (but not used) so that you can have two `delete` in the same expression? – drmrbrewer Nov 06 '22 at 21:27
  • 1
    @drmrbrewer: yes, exactly. – GMB Nov 06 '22 at 21:34
  • Thanks. Although it works, what I don't understand is why `where s.ref_time = r.ref_time` is valid, since `r.ref_time` here is not just a single value, but (possibly) many values... i.e. it's not the same as just `where s.ref_time = 27794160` which is what I'm more used to. Is there a reference for this type of expression, so that I can read more about it? – drmrbrewer Nov 06 '22 at 23:03
  • 1
    @drmrbrewer: yes, `r` is a dataset, not a single record; the `delete ... using ...` syntax is really like a `join`, in essence. – GMB Nov 06 '22 at 23:08
  • Now I'd like to be a bit more sophisticated about how many records to delete... I've created a new question here: https://stackoverflow.com/q/74349852/4070848 – drmrbrewer Nov 07 '22 at 16:31