Simplifying WHERE (NOT) IN (...) AND WHERE (NOT) IN (...)

Question

In this dbfiddle demo I have a DELETE FROM... WHERE at the end like so:

    ......
    DELETE FROM data_table 
      WHERE
        (location, param_id, ref_time, fcst_time) NOT IN (SELECT location, param_id, ref_time, fcst_time FROM sel1)
      AND
        (location, param_id, ref_time, fcst_time) NOT IN (SELECT location, param_id, ref_time, fcst_time FROM sel2);

Although it works, it seems unnecessarily wordy, and possibly also not optimal in terms of performance?

Is there any way in which this can be simplified, e.g. by using a single NOT IN statement?

Because of the way that they are defined, there is no overlap/intersection between sel1 and sel2.

status_table has 8033 rows

data_table has 116432724 rows

sel1 has 61860084 rows

sel2 has 53706188 rows

Numbers don't necessarily add up because it's a live database and data is going in all the time. And yes, at the moment there are few if any rows to delete because most/all of the data is current.

You can leave just one `not in` against a [`union`](https://www.postgresql.org/docs/current/queries-union.html) of these selects. As an added bonus, union will deduplicate them. `delete from t1 where col not in (select col from t2 union select col from t3);` — Zegarek, Nov 19 '22 at 12:41
I would first raise my doubts that it actually works as expected. `NOT IN` on a couple of plain columns is tricky matter. Are the table definitions in your fiddle real? All data types and constraints as displayed? All columns `NOT NULL`? Your Postgres version? To optimize performance, we also need to know cardinalities and available indexes. — Erwin Brandstetter, Nov 19 '22 at 12:58
@ErwinBrandstetter the table definitions in the fiddle are the same as my real table definitions, and the same postgres version (15.1). — drmrbrewer, Nov 19 '22 at 13:01
@Zegarek does the added bonus of deduplicating them come at the cost of a slower query? — drmrbrewer, Nov 19 '22 at 15:24
@drmrbrewer [It does seem like it](https://dbfiddle.uk/ERe0Jy13) but in the sense that deduplication adds a bit of cost, that's later made up for when `not in` has to do the comparison. — Zegarek, Nov 19 '22 at 15:32
@Zegarek when I run your demo I make it that the `not in... union` query is slightly faster than than the `not in and not in` query? — drmrbrewer, Nov 19 '22 at 15:41
@drmrbrewer Correct, but I wouldn't rely on a single, random example to formulate a rule. [There](https://dbfiddle.uk/ERe0Jy13), `not in and not in` is the slowest, `not in union all` comes second, `not in union` is the quickest. [Here](https://dbfiddle.uk/q6g0JqLq) and [here](https://dbfiddle.uk/7q9JI7Po), the unions switched places. On my machine, with 50 milion rows, both unions take around 32s, while the initial version takes 37s. — Zegarek, Nov 19 '22 at 15:48
By "cardinality" I mean the rough number of rows in each table and the rough number of rows that overlap between `sel1` and `sel2`, as well as the rough number of rows in `data_table` that will be deleted. Edit that information into your question, please. — Erwin Brandstetter, Nov 19 '22 at 15:48
@Zegarek Although actually in my case there is no overlap/intersection between `sel1` and `sel2` and therefore no deduplicating to do. This is because `sel1` requires `s.ready` while `sel2` requires `NOT s.settled`... if it is `NOT s.settled` then it is also `NOT s.ready` just because of how these are defined. — drmrbrewer, Nov 19 '22 at 15:50
@ErwinBrandstetter looks like I answered your point about overlap with my reply to @Zegarek... there is no overlap/intersection between `sel1` and `sel2`. — drmrbrewer, Nov 19 '22 at 15:53
I've edited the question to add row count for the two tables... I ran a query to try and count the number of rows that will be deleted, but I aborted it after it was running for an hour :-0 ... this query may need some serious optimisation, or a complete rethink. — drmrbrewer, Nov 19 '22 at 16:48
After a hard look at your fiddle, it's safe to say that the query you show in the question is only the tip of the iceberg, and the main issues are hidden below. — Erwin Brandstetter, Nov 19 '22 at 17:22
@ErwinBrandstetter any hints about where the main issues are? — drmrbrewer, Nov 19 '22 at 17:25
Wasn't able to get the complete picture. But after the confusion with `ref_time` + `fcst_time` = `timestamp`, the PK working with `(ref_time, fcst_time)` but `SELECT DISTINCT ON` working with `timestamp` and consequently no index support. I suspect this is more convoluted than need be and there may be much faster solutions, but here is where I tap out ... — Erwin Brandstetter, Nov 19 '22 at 17:30
This question is unclear. What does `status_table` have to do with the question? Are you looking for the shortest way to write the query or for good performance? — Laurenz Albe, Nov 20 '22 at 11:39
@LaurenzAlbe I added the reference to `status_table` row count in response to a comment above. Some info from `status_table` is being pulled over into `data_table` by an `INNER JOIN`... it's hopefully all clear in the dbfiddle. I'm in the process right now of trying to figure out how to optimise indexes for this scenario (i.e. as per the jsfiddle), and I'm lost... it's my first time considering indexes so any insight you have would be useful... I can create a separate question if easier, because it's moving beyond my original question now... or can move this into a discussion. Thanks! — drmrbrewer, Nov 20 '22 at 11:47
Yes, please keep the question simple and self-contained. Comments go away, so the question must be comprehensible without them. If you feel like these are actually two questions, starting a second one is a good idea. — Laurenz Albe, Nov 20 '22 at 11:50
@LaurenzAlbe sure, I've posted a [new question here](https://stackoverflow.com/q/74507911/4070848)... would be great if you could take a look! — drmrbrewer, Nov 20 '22 at 12:05

score 2 · Accepted Answer · answered Nov 22 '22 at 09:57

Answering the original question in this thread:

......
DELETE FROM data_table 
  WHERE
    (location, param_id, ref_time, fcst_time) 
        NOT IN 
    (SELECT location, param_id, ref_time, fcst_time FROM sel1
        UNION ALL
     SELECT location, param_id, ref_time, fcst_time FROM sel2);

Changing WHERE (a) NOT IN (b) AND NOT IN (c) to WHERE (a) NOT IN (b UNION c) not only shortens the expression, but also provides some performance improvements. Depending on how big an overlap there is between b and c, you can pick either UNION or UNION ALL:

WHERE (a) NOT IN (b UNION c) removes duplicates between b and c which adds a small cost initially but can speed things up later when a is compared against the effectively smaller union.
WHERE (a) NOT IN (b UNION ALL c) skips deduplication, so if by design there can't be any, it's the better choice.

Here's a demo showing how plans change. Locally, I also tested a delete from 120m row table against 2x 10m tables with 99% overlap and got 118s for initial version, 105s for union, 98s for union all. With all the same counts and no overlap, it was 118s, 103s, 95s. With more columns and more complex types, I expect the difference to be more pronounced.

I figured I'd make my comment into an answer, adding the tests that followed, leaving the broader optimisation to be discussed in the new thread.

Thanks... would be great if you could offer any thoughts you might have on the optimisation thread :) I'm stuck on that, but beginning to wonder whether `SELECT DISTINCT ON` should simply be avoided at all costs. — drmrbrewer, Nov 22 '22 at 12:06
@drmrbrewer I'll comment here to avoid disturbing the ongoing discussion there. I'd partition your [`data_table`](https://www.postgresql.org/docs/current/ddl-partitioning.html), tweak [parallel worker settings](https://www.postgresql.org/docs/current/how-parallel-query-works.html) and maintain a cached whitelist if this query is something you need to run regularly - you could also run that delete from the same trigger that maintains the cached whitelist. Example of trigger-based caching and how partitioning helps build a parallel plan [here](https://stackoverflow.com/a/74323168/5298879). — Zegarek, Nov 22 '22 at 12:40

score 0 · Answer 2 · edited Nov 20 '22 at 11:52

How about another WITH around everything?

WITH d AS (
WITH
  stats AS (
    SELECT ref_time
      , max(updated) < (round(extract(epoch from now()) / 60) - 200) AS settled
      , (count(*) FILTER (WHERE processed) = count(*)) AND (max(updated) < (round(extract(epoch from now()) / 60) - 200)) AS ready
    FROM status_table
    GROUP BY ref_time
  ),
  min_ts AS (
    SELECT ref_time FROM stats WHERE ready ORDER BY ref_time DESC LIMIT 1
  ),
  sel1 AS (
    -- records that would be selected by an actual data lookup (use same logic)... we need to keep these (don't delete)
    SELECT DISTINCT ON (d.location, d.timestamp, d.param_id)
      d.location, d.param_id, d.ref_time, d.fcst_time
    FROM data_table AS d
    INNER JOIN stats s USING (ref_time)
    WHERE s.ready AND d.timestamp >= (SELECT ref_time FROM min_ts)
    ORDER BY d.location, d.timestamp, d.param_id, d.ref_time DESC
  ),
  sel2 AS (
    -- also keep all records that are in-progress (not 'settled')
    SELECT
      d.location, d.param_id, d.ref_time, d.fcst_time
    FROM data_table AS d
    INNER JOIN stats AS s USING (ref_time)
    WHERE NOT s.settled
  )
SELECT data_table.*
FROM
    data_table LEFT JOIN
    sel1 ON sel1.location=data_table.location AND sel1.param_id=data_table.param_id AND sel1.ref_time = data_table.ref_time AND sel1.fcst_time = data_table.fcst_time LEFT JOIN
    sel2 ON sel2.location=data_table.location AND sel2.param_id=data_table.param_id AND sel2.ref_time = data_table.ref_time AND sel2.fcst_time = data_table.fcst_time
WHERE
    sel1.location IS NULL AND
    sel2.location IS NULL
)    
DELETE FROM data_table
WHERE (location, param_id, ref_time, fcst_time) IN (SELECT location, param_id, ref_time, fcst_time FROM d)

Simplifying WHERE (NOT) IN (...) AND WHERE (NOT) IN (...)

2 Answers2

Linked