1

My table structure is roughly as described in this post:

name processed processing updated ref_time
abc t f 27794395 27794160
def t f 27794395 27793440
ghi t f 27794395 27793440
jkl f f 27794395 27794160
mno t f 27794395 27793440
pqr f t 27794395 27794160

I created a dbfiddle already based on this table structure (more on this below), so there is no need to create your own.

Based on this answer, I am deriving a list of ref_time values to use as a basis for deleting 'old' entries from status_table:

with 
    ref as (
        select ref_time 
        from status_table 
        group by ref_time 
        having bool_and(processed)
        order by ref_time desc
        offset 1
    )
delete from status_table s
using ref r
where s.ref_time = r.ref_time

But now I want to be more sophisticated about what I use as the offset... I would ideally like to keep the most recent ref_time for which all records are processed (as per the above example where offset is 1), but the two most recent ref_time where the second ref_time has more associated records than the first (i.e. offset needs to be 2 to skip over the two most recent ref_time).

I figure that the following query (based on this answer) will help in this task, because it counts the total number of processed records based on ref_time:

select ref_time, 
    count(*) cnt_total,
    count(*) filter(where processed) cnt_processed,
    round(avg(processed::int),2) ratio_processed
from status_table
group by ref_time
order by ratio_processed desc, ref_time desc;

So in this dbfiddle I'd need to preserve ref_time=27794160 (rather than include it in the delete list as is the case in the example) because, although it is second, it also has a higher cnt_total than the first.

In general, the rule is that I want to keep all ref_time up to (but not including) the ref_time having the same cnt_total as the one before (or less).

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
drmrbrewer
  • 11,491
  • 21
  • 85
  • 181
  • Can you edit the second last paragraph to make it more obvious you've included a dbfiddle please? I just went off and created my own without realising. – Jim Jimson Nov 10 '22 at 06:24
  • In the last example you provide, you have 4 ref_time groups (27794170, 27794160, 27793450, 27793440). Your goal is to delete everything after 27794170 except in the case where the record before has a higher count (in this case 27794160). What happens if 27793450 has a higher count than 27794160? – Jim Jimson Nov 10 '22 at 06:31
  • @JimJimson you asked "What happens if 27793450 has a higher count than 27794160?". As per the rule in the final paragraph, if that's the case then we should keep 27793450... but bear in mind that because of the `having bool_and(processed)` in the `delete` query, we're only deleting if all records are `processed`... which isn't the case in this example for 27793450... so we'd keep it regardless of whether or not it has a higher count than 27794160. – drmrbrewer Nov 10 '22 at 07:53

2 Answers2

1
WITH sel AS (
   SELECT ref_time
   FROM  (
      SELECT ref_time
           , count(*) FILTER (WHERE drop) OVER (ORDER BY ref_time DESC) AS drops
      FROM  (
         SELECT ref_time
              , lag(count(*)) OVER (ORDER BY ref_time DESC) >= count(*) IS TRUE AS drop
         FROM   status_table
         GROUP  BY ref_time
         HAVING bool_and(processed)
         ) sub1
      ) sub2
   WHERE drops > 0
   )
DELETE FROM status_table d
USING  sel s
WHERE  d.ref_time = s.ref_time;

fiddle

Subquery sub1 is mostly what you already had. Plus some more:

We only need count(*) since HAVING bool_and(processed) excludes other cases anyway.

lag(count(*)) OVER (ORDER BY ref_time DESC) >= count(*) checks whether the previous row has been same or greater, in which case, we'll call that a "drop". And we want to keep all rows before the first "drop".

So count(*) FILTER (WHERE drop) OVER (ORDER BY ref_time DESC) AS drops in sub2 counts those "drops", and WHERE drops > 0 eliminate the leading rows we want to preserve.

About this "gaps-and-islands" approach:

If there can be concurrent writes on status_table, you may need to add the locking clause FOR UPDATE to be sure. But you cannot do that in an aggregate query, so you would add another subquery where you do that ... See:

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • This is great... so many learning points! As a minor bolt-on, I have a separate DELETE query I run which clears out "old" data based on the `updated` timestamp (hence the reason for that column)... basically I want to keep any `ref_time` that isn't all `processed` (on the basis that it may still be "in process"... hence filtering those out of the existing DELETE list)... but I after some time I can assume that the `ref_time` will never all be `processed` and just delete everything associated with it... – drmrbrewer Nov 10 '22 at 12:13
  • I have created a fiddle which I think works, to combine my `updated` based DELETE into this one, but I can't help feeling that it's not optimal... what do you think? https://dbfiddle.uk/z9I7qqeV – drmrbrewer Nov 10 '22 at 12:13
  • The only changes compared to yours are any instances of `max(updated)` or `max_updated` in the queries. I don't like the repetition of the threshold `27794395` (prone to bugs when updating one but not another) but since this will likely be in a bash script I can use a variable for that... but perhaps the SQL itself could be tidied up to avoid this repetition? – drmrbrewer Nov 10 '22 at 12:16
  • Look mostly good to me. Some suggestions: https://dbfiddle.uk/EZOtM6po Or start a new question. Comments are not the place ... – Erwin Brandstetter Nov 10 '22 at 12:20
  • I could ask a new question but I'm not sure what question to ask, since it's so specific. The problem I see with your cut-down version is that the "old" (but not fully `processed`) data could mess up the `drops` sequence because they might bubble up into that sequence... hence in my fiddle above I created an `old` column and used ORDER DESC to drop those down to the bottom, out of the way of the `drops` sequence (which should only apply to fully `processed` data). See the "old" entries with `ref_time=27794165` (not all `processed`) in this fiddle: https://dbfiddle.uk/b0xzz-3t – drmrbrewer Nov 10 '22 at 12:41
  • @drmrbrewer: Sounds like you should just run a simple, separate `DELETE` for the "old" ones. – Erwin Brandstetter Nov 10 '22 at 12:46
  • OK, seems like good pragmatic advice... this is basically what I'm doing already (i.e. running a separate `DELETE`) and it sounds like it's over-complicating things to try and amalgamate with this one. – drmrbrewer Nov 10 '22 at 13:41
0

Here's a simple translation into window functions:

with 
 count_per_completed_ref_time as
  ( select 
        ref_time, 
        count(*) cnt_total
    from  status_table 
    group by ref_time 
    having bool_and(processed)
    order by ref_time desc )
,windowed_counts as
  ( select 
        ref_time,
        cnt_total,
        row_number() over w1 as ref_time_num,
        lag(cnt_total) over w1 as preceding_cnt_total
    from count_per_completed_ref_time
    window w1 as (order by ref_time desc) )
delete from status_table s
where ref_time in --only delete completed ref_times, found in the first CTE
  ( select ref_time from count_per_completed_ref_time)
and ref_time not in --prevent deleting these
  ( select ref_time
    from windowed_counts
    where ref_time_num = 1 --top, latest completed ref_time
    or (  ref_time_num = 2 --second latest
        and cnt_total>preceding_cnt_total)--has higher total than the latest
   ) 
returning *;
  1. Selects your ref_time rows same as before, keeping only the ones that are fully processed.
  2. Uses a window to show order of these ref_time's, plus the cnt_total of the preceding ref_time, one higher in this ranking.
  3. Always picks the top ref_time in this order and if the second one's cnt_total is higher, takes it too.
  4. Deletes all completed ref_time's found in 1., unless they were picked in 3., returning everything it deleted, for you to inspect.

Demo with additional test cases.

Zegarek
  • 6,424
  • 1
  • 13
  • 24