Optimising primary keys and indexes for query with INNER JOIN and SELECT DISTINCT ON

Question

I have a dbfiddle demo with the following tables defined:

CREATE TABLE status_table (
  base_name  text      NOT NULL
, version    smallint  NOT NULL
, ref_time   int       NOT NULL
, processed  bool      NOT NULL
, processing bool      NOT NULL
, updated    int       NOT NULL
, PRIMARY KEY (base_name, version)
);

CREATE TABLE data_table (
  location  text      NOT NULL
, param_id  text      NOT NULL
, ref_time  int       NOT NULL
, fcst_time smallint  NOT NULL
, timestamp int       NOT NULL
, value     text      NOT NULL
, PRIMARY KEY (location, param_id, ref_time, fcst_time)
);

There are no other indexes defined.

Note that for each row in data_table, it is the case that ref_time + fcst_time = timestamp which I know isn't ideal but it's the way it's evolved. So ref_time is like a base time (for a batch of data) and fcst_time is an offset time, giving the actual timestamp for a data record (there is a timeseries of data records for each batch starting at ref_time and having a progressively increasing timestamp or fcst_time and a single data value).

I then have the following complicated query for deleting selected rows from data_table. It is pulling some info as stats from status_table and joining that onto data_table, then selecting rows that should not be deleted (sel1 and sel2), and then deleting all rows of data_table that are not in sel1 and also not in sel2.

As an aside, sel1 basically corresponds to my query for reading data from data_table (though I limit to a particular location when doing so, and therefore it's quite quick)... therefore sel1 is just the set of rows that might be selected in a query... I want to keep those and not delete them.

Then sel2 are those rows that relate to data that is still being processed, so I need to keep those too.

So with that in mind, here is the query:

WITH
  stats AS (
    SELECT ref_time
      , max(updated) < (round(extract(epoch from now()) / 60) - 200) AS settled
      , (count(*) FILTER (WHERE processed) = count(*)) AND (max(updated) < (round(extract(epoch from now()) / 60) - 200)) AS ready
    FROM status_table
    GROUP BY ref_time
  ),
  min_ts AS (
    SELECT ref_time FROM stats WHERE ready ORDER BY ref_time DESC LIMIT 1
  ),
  sel1 AS (
    -- we need to keep all of these rows (don't delete)
    SELECT DISTINCT ON (d.location, d.timestamp, d.param_id)
      d.location, d.param_id, d.ref_time, d.fcst_time
    FROM data_table AS d
    INNER JOIN stats s USING (ref_time)
    WHERE s.ready AND d.timestamp >= (SELECT ref_time FROM min_ts)
    ORDER BY d.location, d.timestamp, d.param_id, d.ref_time DESC
  ),
  sel2 AS (
    -- we also need to keep all of these rows (don't delete)
    SELECT
      d.location, d.param_id, d.ref_time, d.fcst_time
    FROM data_table AS d
    INNER JOIN stats AS s USING (ref_time)
    WHERE NOT s.settled
  )
DELETE FROM data_table 
  WHERE
    (location, param_id, ref_time, fcst_time) NOT IN (SELECT location, param_id, ref_time, fcst_time FROM sel1)
  AND
    (location, param_id, ref_time, fcst_time) NOT IN (SELECT location, param_id, ref_time, fcst_time FROM sel2);

But I'm finding that this is horribly slow in my actual database. I know that I need to optimise my indexes and possibly primary keys, and have tried various things without any real success, so I'm a bit lost.

Here is the output of an EXPLAIN for the above query on my actual database:

                                                       QUERY PLAN                                                        
-------------------------------------------------------------------------------------------------------------------------
 Delete on data_table  (cost=4002975.62..118180240066541.86 rows=0 width=0)
   CTE stats
     ->  HashAggregate  (cost=234.02..234.21 rows=4 width=6)
           Group Key: status_table.ref_time
           ->  Seq Scan on status_table  (cost=0.00..164.01 rows=7001 width=9)
   ->  Seq Scan on data_table  (cost=4002741.41..118180240066307.66 rows=19567628 width=6)
         Filter: ((NOT (SubPlan 3)) AND (NOT (SubPlan 4)))
         SubPlan 3
           ->  Materialize  (cost=4002741.30..4293628.93 rows=7691318 width=18)
                 ->  Subquery Scan on sel1  (cost=4002741.30..4210105.34 rows=7691318 width=18)
                       ->  Unique  (cost=4002741.30..4133192.16 rows=7691318 width=22)
                             InitPlan 2 (returns $1)
                               ->  Limit  (cost=0.09..0.09 rows=1 width=4)
                                     ->  Sort  (cost=0.09..0.10 rows=2 width=4)
                                           Sort Key: stats.ref_time DESC
                                           ->  CTE Scan on stats  (cost=0.00..0.08 rows=2 width=4)
                                                 Filter: ready
                             ->  Sort  (cost=4002741.20..4035353.91 rows=13045086 width=22)
                                   Sort Key: d.location, d."timestamp", d.param_id, d.ref_time DESC
                                   ->  Hash Join  (cost=0.11..1925948.51 rows=13045086 width=22)
                                         Hash Cond: (d.ref_time = s.ref_time)
                                         ->  Seq Scan on data_table d  (cost=0.00..1697659.40 rows=26090171 width=22)
                                               Filter: ("timestamp" >= $1)
                                         ->  Hash  (cost=0.08..0.08 rows=2 width=4)
                                               ->  CTE Scan on stats s  (cost=0.00..0.08 rows=2 width=4)
                                                     Filter: ready
         SubPlan 4
           ->  Materialize  (cost=0.11..2611835.48 rows=39135256 width=18)
                 ->  Hash Join  (cost=0.11..2186850.21 rows=39135256 width=18)
                       Hash Cond: (d_1.ref_time = s_1.ref_time)
                       ->  Seq Scan on data_table d_1  (cost=0.00..1501983.12 rows=78270512 width=18)
                       ->  Hash  (cost=0.08..0.08 rows=2 width=4)
                             ->  CTE Scan on stats s_1  (cost=0.00..0.08 rows=2 width=4)
                                   Filter: (NOT settled)
 JIT:
   Functions: 45
   Options: Inlining true, Optimization true, Expressions true, Deforming true
(37 rows)

You may find it helpful to write a query creating a temporary table containing the rows to save. That's a way to optimize your (intricate) row-selection logic nondestructively, CTE by CTE and subquery by subquery. — O. Jones, Nov 20 '22 at 12:14
@O.Jones isn't that effectively what `sel1` and `sel2` are... temporary tables containing rows to save? — drmrbrewer, Nov 20 '22 at 13:02
I actually mean CREATE TEMPORARY TABLE .... That way you can test stuff without actually DELETEing anything. — O. Jones, Nov 20 '22 at 14:56
With actual temporary tables you can create them and see how many rows they contain. With CTEs in a giant non-runnable statement, you can't see how many rows they actually produce, only how many the system thought they would produce. — jjanes, Nov 20 '22 at 14:58
You could increase work_mem until the 'NOT (SubPlan N)' turns into 'NOT (hashed SubPlan N)'. But that might not be feasible to do if your RAM is too small. — jjanes, Nov 20 '22 at 15:04
Before anything else I would create an index on the columns ref_time in both tables. Because these columns are used in the JOIN conditions, the GROUP BY and the ORDER BY. — Frank Heikens, Nov 20 '22 at 15:44
@FrankHeikens the output from `EXPLAIN` is completely unchanged from what is in my question, after creating an index on `ref_time` in both tables :-/ — drmrbrewer, Nov 20 '22 at 16:27
I also tried adding indexes in `data_table` on `(location, timestamp, param_id, ref_time DESC)` and `(timestamp)` but still no change... perhaps something else is preventing use of these indexes? — drmrbrewer, Nov 20 '22 at 16:42
@O.Jones ah OK, something [like this](https://dbfiddle.uk/kzw-vNCi)... just replacing each CTE with a temporary table. — drmrbrewer, Nov 20 '22 at 17:25
@LaurenzAlbe I've tried that but seriously it just takes so long that I've just aborted before the results are shown... there must be something really glaringly obviously wrong with the indexes or query, or both, as to why the performance is so bad? Perhaps it's just the wrong approach to be finding rows *not* to delete rather than finding rows *to* delete? — drmrbrewer, Nov 20 '22 at 19:13
Have you tried saving `stats` as a regular table beforehand, `create index on stats (ref_time DESC nulls last);` plus `create index on data_table (location, timestamp, param_id, ref_time DESC nulls last) include (fcst_time);`, plus a [union of your `not in` lists](https://stackoverflow.com/a/74530698/5298879)? — Zegarek, Nov 22 '22 at 12:47
@Zegarek there are only 4 rows in the `stats` table so I doubt that indexing it will make any difference? Even with that extra index (but still using a CTE for `stats`) the `EXPLAIN` seems horrible: https://pastebin.com/sU0EiYV6 — drmrbrewer, Nov 22 '22 at 19:35

Pompedup · Answer 1 · 2022-11-20T14:58:31.660

Does this improve your explain plan ?

The union remove the AND check in your delete

WITH
  stats AS (
    SELECT ref_time
      , max(updated) < (round(extract(epoch from now()) / 60) - 200) AS settled
      , (count(*) FILTER (WHERE processed) = count(*)) AND (max(updated) < (round(extract(epoch from now()) / 60) - 200)) AS ready
    FROM status_table
    GROUP BY ref_time
  ),
  min_ts AS (
    SELECT ref_time FROM stats WHERE ready ORDER BY ref_time DESC LIMIT 1
  ),
  sel1 AS (
    -- records that would be selected by an actual data lookup (use same logic)... we need to keep these (don't delete)
    SELECT DISTINCT ON (d.location, d.timestamp, d.param_id)
      d.location, d.param_id, d.ref_time, d.fcst_time
    FROM data_table AS d
    INNER JOIN stats s USING (ref_time)
    WHERE s.ready AND d.timestamp >= (SELECT ref_time FROM min_ts)
    ORDER BY d.location, d.timestamp, d.param_id, d.ref_time DESC
  ),
  sel2 AS (
    -- also keep all records that are in-progress (not 'settled')
    SELECT
      d.location, d.param_id, d.ref_time, d.fcst_time
    FROM data_table AS d
    INNER JOIN stats AS s USING (ref_time)
    WHERE NOT s.settled
  ),
  sel AS (
    SELECT * FROM sel1
    UNION SELECT * FROM sel2
  )
DELETE FROM data_table 
  WHERE
    (location, param_id, ref_time, fcst_time) NOT IN (SELECT location, param_id, ref_time, fcst_time FROM sel);

I just update the query to create a new temporary table, the syntax problem was the `order by` because on `union` it has to be at the end of the last union — Pompedup, Nov 20 '22 at 14:59

Optimising primary keys and indexes for query with INNER JOIN and SELECT DISTINCT ON

1 Answers1

Linked