Optimizing a row exclusion query

Question

I am designing a mostly read-only database containing 300,000 documents with around 50,000 distinct tags, with each document having 15 tags on average. For now, the only query I care about is selecting all documents with no tag from a given set of tags. I'm only interested in the document_id column (no other columns in the result).

My schema is essentially:

CREATE TABLE documents (
    document_id  SERIAL  PRIMARY KEY,
    title        TEXT
);

CREATE TABLE tags (
    tag_id  SERIAL  PRIMARY KEY,
    name    TEXT    UNIQUE
);

CREATE TABLE documents_tags (
    document_id    INTEGER  REFERENCES documents,
    tag_id         INTEGER  REFERENCES tags,

    PRIMARY KEY (document_id, tag_id)
);

I can write this query in Python by pre-computing the set of documents for a given tag, which reduces the problem to a few fast set operations:

In [17]: %timeit all_docs - (tags_to_docs[12345] | tags_to_docs[7654])
100 loops, best of 3: 13.7 ms per loop

Translating the set operations to Postgres doesn't work that fast, however:

stuff=# SELECT document_id AS id FROM documents WHERE document_id NOT IN (
stuff(#     SELECT documents_tags.document_id AS id FROM documents_tags
stuff(#     WHERE documents_tags.tag_id IN (12345, 7654)
stuff(# );
   document_id
---------------
    ...
Time: 201.476 ms

Replacing NOT IN with EXCEPT makes it even slower.
I have btree indexes on document_id and tag_id in all three tables and another one on (document_id, tag_id).
The default memory limits on Postgres' process have been increased significantly, so I don't think Postgres is misconfigured.

How do I speed up this query? Is there any way to pre-compute the mapping between like I did with Python, or am I thinking about this the wrong way?

Here is the result of an EXPLAIN ANALYZE:

EXPLAIN ANALYZE
SELECT document_id AS id FROM documents
WHERE document_id NOT IN (
    SELECT documents_tags.documents_id AS id FROM documents_tags
    WHERE documents_tags.tag_id IN (12345, 7654)
);
                                                                            QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on documents  (cost=20280.27..38267.57 rows=83212 width=4) (actual time=176.760..300.214 rows=20036 loops=1)
   Filter: (NOT (hashed SubPlan 1))
   Rows Removed by Filter: 146388
   SubPlan 1
     ->  Bitmap Heap Scan on documents_tags  (cost=5344.61..19661.00 rows=247711 width=4) (actual time=32.964..89.514 rows=235093 loops=1)
           Recheck Cond: (tag_id = ANY ('{12345,7654}'::integer[]))
           Heap Blocks: exact=3300
           ->  Bitmap Index Scan on documents_tags__tag_id_index  (cost=0.00..5282.68 rows=247711 width=0) (actual time=32.320..32.320 rows=243230 loops=1)
                 Index Cond: (tag_id = ANY ('{12345,7654}'::integer[]))
 Planning time: 0.117 ms
 Execution time: 303.289 ms
(11 rows)

Time: 303.790 ms

The only settings I changed from the default configuration were:

shared_buffers = 5GB
temp_buffers = 128MB
work_mem = 512MB
effective_cache_size = 16GB

Running Postgres 9.4.5 on a server with 64GB RAM.

201 milliseconds is not particularly slow. What are you aiming for? — Gordon Linoff, Aug 06 '16 at 19:41
@GordonLinoff: Python runs in 13ms, so something around there. — user2472188, Aug 06 '16 at 20:04
@user2472188 . . . Python is starting with the data in memory. It does not provide ACID-compliant abilities, logging, or a myriad other capabilities. The timing comparison is not really appropriate. — Gordon Linoff, Aug 06 '16 at 22:05
Looking at your setup some more: you have 300k docs, 50k tags, each doc ~ 15 tags. And you query all docs except the ones with two given tags. That's ***close to all*** docs, you exclude a couple of dozen docs at best. Is that your actual situation? What do you do with close to 300k rows in the answer? — Erwin Brandstetter, Aug 06 '16 at 23:51

Erwin Brandstetter · Accepted Answer · 2023-04-11T01:59:46.123

Optimize setup for read performance

Your memory settings seem reasonable for a 64GB server - except maybe work_mem = 512MB. That's high. Your queries are not particularly complex and your tables are not that big.

4.5 million rows (300k x 15) in the simple junction table documents_tags should occupy ~ 156 MB and the PK another 96 MB. For your query you typically don't need to read the whole table, just small parts of the index. For "mostly read-only" like you have, you should see index-only scans on the index of the PK exclusively. You don't need nearly as much work_mem - which may not matter much - except where you have many concurrent queries. Quoting the manual:

... several running sessions could be doing such operations concurrently. Therefore, the total memory used could be many times the value of work_mem; it is necessary to keep this fact in mind when choosing the value.

Setting work_mem too high may actually impair performance:

Increasing work_mem and shared_buffers on Postgres 9.2 significantly slows down queries

I suggest to reduce work_mem to 128 MB or less to avoid possible memory starvation- unless you have other common queries that require more. You can always set it higher locally for special queries.

There are several other angles to optimize read performance:

Configuring PostgreSQL for read performance

Key problem: leading index column

All of this may help a little. But the key problem is this:

PRIMARY KEY (document_id, tag_id)

300k documents, 2 tags to exclude. Ideally, you have an index with tag_id as leading column and document_id as 2nd. With an index on just (tag_id) you can't get index-only scans. If this query is your only use case, change your PK as demonstrated below.

Or probably even better: you can create an additional plain index on (tag_id, document_id) if you need both - and drop the two other indexes on documents_tags on just (tag_id) and (document_id). They offer nothing over the two multicolumn indexes. The remaining 2 indexes (as opposed to 3 indexes before) are smaller and superior in every way. Rationale:

While being at it, I suggest to also CLUSTER the table using the new PK, all in one transaction, possibly with some extra maintenance_work_mem locally:

BEGIN;
SET LOCAL maintenance_work_mem = '256MB';

ALTER TABLE documents_tags 
  DROP CONSTRAINT documents_tags_pkey
, ADD  PRIMARY KEY (tag_id, document_id);  -- tag_id first.

CLUSTER documents_tags USING documents_tags_pkey;

COMMIT;

Don't forget to:

VACUUM ANALYZE documents_tags;

Queries

The query itself is run-of-the-mill. Here are the 4 standard techniques:

Select rows which are not present in other table

NOT IN is - to quote myself:

Only good for small sets without NULL values

Your use case exactly: all involved columns NOT NULL and your list of excluded items is very short. Your original query is a hot contender.

NOT EXISTS and LEFT JOIN / IS NULL are always hot contenders. Both have been suggested in other answers. LEFT JOIN has to be an actual LEFT [OUTER] JOIN, though.

EXCEPT ALL would be shortest, but often not as fast.

1. `NOT IN`

SELECT document_id
FROM   documents d
WHERE  document_id NOT IN (
   SELECT document_id  -- no need for column alias, only value is relevant
   FROM   documents_tags
   WHERE  tag_id IN (12345, 7654)
   );

2. `NOT EXISTS`

SELECT document_id
FROM   documents d
WHERE  NOT EXISTS (
   SELECT FROM documents_tags
   WHERE  document_id = d.document_id
   AND    tag_id IN (12345, 7654)
   );

3. `LEFT JOIN` / `IS NULL`

SELECT d.document_id
FROM   documents d
LEFT   JOIN documents_tags dt ON dt.document_id = d.document_id
                             AND dt.tag_id IN (12345, 7654)
WHERE  dt.document_id IS NULL;

4. `EXCEPT ALL`

SELECT document_id
FROM   documents
EXCEPT ALL               -- ALL keeps duplicate rows and makes it faster
SELECT document_id
FROM   documents_tags
WHERE  tag_id IN (12345, 7654);

Benchmark

To put my theories to the test.

Test setup

SET random_page_cost = 1.1;
SET work_mem = '128MB';

CREATE TABLE documents (
  document_id serial PRIMARY KEY
, title       text
);

-- CREATE TABLE tags ( ...  -- irrelevant for this query    

CREATE TABLE documents_tags (
  document_id int  REFERENCES documents
, tag_id      int  -- REFERENCES tags  -- irrelevant for test
 -- no PK yet, to test seq scan
 -- it's also faster to create the PK after filling the big table
);

INSERT INTO documents (title)
SELECT 'some dummy title ' || g
FROM   generate_series(1, 300000) g;

INSERT INTO documents_tags(document_id, tag_id)
SELECT i.*
FROM   documents d
CROSS  JOIN LATERAL (
   SELECT DISTINCT d.document_id, ceil(random() * 50000)::int
   FROM   generate_series (1,30)
   WHERE  random() > .5
   ) i;

ALTER TABLE documents_tags ADD PRIMARY KEY (document_id, tag_id);  -- current idx
    
VACUUM ANALYZE documents_tags;
VACUUM ANALYZE documents;

Note that rows in documents_tags are physically clustered by document_id due to the way I filled the table - which is likely your current situation as well.

Original Test with Postgres 9.5 on 2016-08-06

On my old laptop with 4 GB RAM.

3 test runs with each of the 4 queries, best of 5 every time to exclude caching effects.

Test 1: With documents_tags_pkey like you have it. Index and physical order of rows are bad for our query.
Test 2: Recreate the PK on (tag_id, document_id) like suggested.
Test 3: CLUSTER on new PK. Execution time of EXPLAIN ANALYZE in ms:

  time in ms   | Test 1 | Test 2 | Test 3
---------------+--------+--------+-------
1. NOT IN      | 654    |  70    |  71  -- winner!
2. NOT EXISTS  | 684    | 103    |  97
3. LEFT JOIN   | 685    |  98    |  99
4. EXCEPT ALL  | 833    | 255    | 250

Conclusions

Key element is the right index with leading tag_id - for queries involving few tag_id and many document_id.
_{To be precise, it's not important that there are more distinct document_id than tag_id. This could be the other way round as well. Btree indexes basically perform the same with any order of columns. It's the fact that the most selective predicate in your query filters on tag_id. And that's faster on the leading index column(s).}
The winning query for few tag_id to exclude is your original with NOT IN.
NOT EXISTS and LEFT JOIN / IS NULL result in the same query plan. For more than a couple of dozen IDs, I expect these to scale better ...
In a read-only situation you'll see index-only scans exclusively, so the physical order of rows in the table becomes irrelevant. Hence, test 3 did not bring any more improvements.
If writes to the table happen and autovacuum can't keep up, you'll see (bitmap) index scans. Physical clustering is important for those.

Retest with Postgres 15 on 2023-04-11

With plenty of work_mem on a newer laptop with 64 GB RAM.

I skipped "Test 2", and added "Test 1+" instead. That's the same as Test 1, but with increased max_parallel_workers_per_gather so that 3 parallel workers were used instead of just 2.

  time in ms   | Test 1 | Test 1+ | Test 3
---------------+--------+---------+-------
1. NOT IN      | 170    | 154     |  42
2. NOT EXISTS  | 153    | 132     |  35  -- winner!
3. LEFT JOIN   | 154    | 135     |  35  -- same query plan
4. EXCEPT ALL  | 276    | 254     | 137

As expected, performance generally improved.

NOT IN could not keep up, since it poses some inherent optimization barriers. The clear winner is now NOT EXISTS or LEFT JOIN / NOT NULL, using all index-only scans.

"Test 1" and "Test 1+" look better than they were. Those used 2 and 3 workers, so multiply the numbers by that factor go get actual system load.

BTW: I'd expect {NOT IN / NOT EXISTS / LEFT JOIN WHERE NULL} to all give the same query plan (given all columns defined as NOT NULL) — wildplasser, Aug 07 '16 at 11:28
@wildplasser: In my test I saw the same plan for `NOT EXISTS` and `LEFT JOIN`, but different plans for `EXCEPT` and `NOT IN`. And that's not exceptional. — Erwin Brandstetter, Aug 07 '16 at 13:43
Even with the FK columns as NOT NULLable? BTW: I've never used EXEPT in my life. BTW2: dropping the index (or PK) and replacing it by the swapped index is a bit dangerous IMO, the index is also used as a supporting index for the FKs, and it will be useful when enforcing the constraint (eg if rows in documents or tags are deleted) I tend to keep *both* composite indexes, mainly for this reason. — wildplasser, Aug 07 '16 at 14:10
@wildplasser: I included the test setup, run a quick test to verify. As for "supporting index": The PKs in `tags` & `documents` are required by the FKs, but *not* the composite PK of `documents_tags`. Other depending objects withstanding, the switch is safe even against concurrent write access when done in a single transaction like demonstrated. (Just a brief exclusive lock.) And yes, I agree: having *both* indexes `(tag_id, document_id)` and `(document_id, tag_id)` is often the best choice for frequently queried tables. I did mention that as alternative above, I put more emphasis on it now. — Erwin Brandstetter, Aug 07 '16 at 14:58

Bohemian · Answer 2 · 2016-08-06T21:39:16.537

1

Use an outer join, with the tag condition on the join, keeping only missed joins to return where none of the specified tags match:

select d.id
from documents d
join documents_tags t on t.document_id = d.id
  and t.tag_id in (12345, 7654)
where t.document_id is null

edited Aug 06 '16 at 21:39

answered Aug 06 '16 at 20:32

Bohemian

412,405
93
575
722