My table structure is roughly as described in this post:
name | processed | processing | updated | ref_time |
---|---|---|---|---|
abc | t | f | 27794395 | 27794160 |
def | t | f | 27794395 | 27793440 |
ghi | t | f | 27794395 | 27793440 |
jkl | f | f | 27794395 | 27794160 |
mno | t | f | 27794395 | 27793440 |
pqr | f | t | 27794395 | 27794160 |
I created a dbfiddle already based on this table structure (more on this below), so there is no need to create your own.
Based on this answer, I am deriving a list of ref_time
values to use as a basis for deleting 'old' entries from status_table
:
with
ref as (
select ref_time
from status_table
group by ref_time
having bool_and(processed)
order by ref_time desc
offset 1
)
delete from status_table s
using ref r
where s.ref_time = r.ref_time
But now I want to be more sophisticated about what I use as the offset
... I would ideally like to keep the most recent ref_time
for which all records are processed (as per the above example where offset
is 1
), but the two most recent ref_time
where the second ref_time
has more associated records than the first (i.e. offset
needs to be 2
to skip over the two most recent ref_time
).
I figure that the following query (based on this answer) will help in this task, because it counts the total number of processed
records based on ref_time
:
select ref_time,
count(*) cnt_total,
count(*) filter(where processed) cnt_processed,
round(avg(processed::int),2) ratio_processed
from status_table
group by ref_time
order by ratio_processed desc, ref_time desc;
So in this dbfiddle I'd need to preserve ref_time=27794160
(rather than include it in the delete list as is the case in the example) because, although it is second, it also has a higher cnt_total
than the first.
In general, the rule is that I want to keep all ref_time
up to (but not including) the ref_time
having the same cnt_total
as the one before (or less).