Fast tuple lookup in large indexed table (Postgres)

Question

Table Structure

I have a Postgres table T with 3 columns (w1, w2, occurrences), all three are integers and w1 and w2 are referencing to another table. So basically, there are two foreign keys and a value.
There are about 15 million unique values (indices) that w1 and w2 can become in each row. Currently, T holds around 1.8 billion rows, which are more or less permutations of the indices with the limitation, that - if you think of a symmetric matrix - only one value pair may exist. For example, there may be (252, 13, x) but not (13, 252, x). However, there is no sorting, so (5, 900, x) also may be in T (sorting is done during insert and depends on the value in the referenced table). These tuples (w1, w2) are unique. At the moment, there are 3 different indices on that table, a UNIQUE INDEX tdc_idx_1 on T (w1, w2), a INDEX tdc_idx_w1 on T (w1) and a INDEX tdc_idx_w2 on T (w2).

Problem Statement

Basically, I want to run two different queries. The difficulty for me is to figure out where the bad performance comes from, or to accept the runtime given the "complexity" of the query and the table size (I'll guess that is not the case..). Short, I could need help in the structure of the queries and probably the handling of the table indices (or maybe the table design in general).
The most straightforward query for my desired result is

-- A (the OR-query)
SELECT w1, w2, occurrences FROM public.T
WHERE (w1 in (123, 555, 999) OR w2 in (123, 555, 999) AND occurrences > 1;

and

-- B (the AND-query)
SELECT w1, w2, occurrences FROM public.T
WHERE (w1 in (123, 555, 999) AND w2 in (123, 555, 999) AND occurrences > 1;

respectively (note the OR/AND).

Now I will give some performance measures and some alternative queries I came up with using ANALYZE EXPLAIN.

Performance

Query A

-- A (OR)
EXPLAIN ANALYSE 
SELECT w1, w2, occurrences
FROM public.T
WHERE (w1 in (123, 555, 999) OR w2 in (123, 555, 999)) AND occurrences > 1

Successfully run. Total query runtime: 29 min 19 secs.
173071 rows affected. (Actual result w/o EXPLAIN ANALYSE)

Gather  (cost=3378.13..13984524.32 rows=50262 width=12) (actual time=144.749..1751127.665 rows=173071 loops=1)
  Workers Planned: 2
  Workers Launched: 2
  ->  Parallel Bitmap Heap Scan on T  (cost=2378.13..13978498.12 rows=20942 width=12) (actual time=154.140..1750873.360 rows=57690 loops=3)
        Recheck Cond: ((w1 = ANY ('{123,555,999}'::integer[])) OR (w2 = ANY ('{123,555,999}'::integer[])))
        Rows Removed by Index Recheck: 20069348
        Filter: (occurences > 1)
        Rows Removed by Filter: 110878
        Heap Blocks: exact=902 lossy=107950
        ->  BitmapOr  (cost=2378.13..2378.13 rows=214069 width=0) (actual time=114.320..114.335 rows=0 loops=1)
              ->  Bitmap Index Scan on tdc_idx_w1  (cost=0.00..1631.03 rows=148439 width=0) (actual time=17.100..17.103 rows=166301 loops=1)
                    Index Cond: (w1 = ANY ('{123,555,999}'::integer[]))
              ->  Bitmap Index Scan on tdc_idx_w2  (cost=0.00..721.96 rows=65630 width=0) (actual time=97.208..97.211 rows=339406 loops=1)
                    Index Cond: (w2 = ANY ('{123,555,999}'::integer[]))
Planning Time: 0.280 ms
JIT:
  Functions: 12
  Options: Inlining true, Optimization true, Expressions true, Deforming true
  Timing: Generation 2.689 ms, Inlining 117.210 ms, Optimization 36.646 ms, Emission 28.227 ms, Total 184.772 ms
Execution Time: 1751571.992 ms

Query B

-- B (AND)
EXPLAIN ANALYSE 
SELECT w1, w2, occurrences
FROM public.T
WHERE (w1 in (123, 555, 999) AND w2 in (123, 555, 999)) AND occurrences > 1

Successfully run. Total query runtime: 1 secs 716 msec.
3 rows affected. (Actual result w/o EXPLAIN ANALYSE)

Index Scan using tdc_idx_1 on T  (cost=0.58..61.13 rows=1 width=12) (actual time=88.895..239.394 rows=3 loops=1)
  Index Cond: ((w1 = ANY ('{123,555,999}'::integer[])) AND (w2 = ANY ('{123,555,999}'::integer[])))
  Filter: (occurences > 1)
Planning Time: 51.256 ms
Execution Time: 239.518 ms

Query A Alternative 1

-- A1 (OR)
EXPLAIN ANALYSE 
SELECT w1, w2, occurences
FROM public.token_doc_cooccurrences
INNER JOIN (
  VALUES (123), (555), (999)
) vals(v)
ON (w1 = v)
WHERE occurences > 1
UNION
SELECT w1, w2, occurences
FROM public.token_doc_cooccurrences
INNER JOIN (
  VALUES (123), (555), (999)
) vals2(w)
ON (w2 = w)
WHERE occurences > 1;

Successfully run. Total query runtime: 29 min 28 secs.
173071 rows affected. (Actual result w/o EXPLAIN ANALYSE)

HashAggregate  (cost=841679.67..842197.01 rows=51734 width=12) (actual time=1755066.279..1755315.668 rows=173071 loops=1)
  Group Key: T.w1, T.w2, T.occurences
  Batches: 5  Memory Usage: 4145kB  Disk Usage: 6920kB
  ->  Append  (cost=561.79..841291.66 rows=51734 width=12) (actual time=1088.188..1753999.477 rows=173074 loops=1)
        ->  Nested Loop  (cost=561.79..579176.31 rows=35898 width=12) (actual time=1088.181..75981.142 rows=55822 loops=1)
              ->  Values Scan on *VALUES*  (cost=0.00..0.04 rows=3 width=4) (actual time=0.006..0.032 rows=3 loops=1)
              ->  Bitmap Heap Scan on T  (cost=561.79..192939.10 rows=11966 width=12) (actual time=407.806..25231.892 rows=18607 loops=3)
                    Recheck Cond: (w1 = *VALUES*.column1)
                    Filter: (occurences > 1)
                    Rows Removed by Filter: 36826
                    Heap Blocks: exact=10630
                    ->  Bitmap Index Scan on tdc_idx_w1  (cost=0.00..558.80 rows=50963 width=0) (actual time=74.175..74.175 rows=55434 loops=3)
                          Index Cond: (w1 = *VALUES*.column1)
        ->  Nested Loop  (cost=250.51..261339.35 rows=15836 width=12) (actual time=217.234..1677133.291 rows=117252 loops=1)
              ->  Values Scan on *VALUES*_1  (cost=0.00..0.04 rows=3 width=4) (actual time=0.020..0.042 rows=3 loops=1)
              ->  Bitmap Heap Scan on T T_1  (cost=250.51..87060.31 rows=5279 width=12) (actual time=143.454..558815.229 rows=39084 loops=3)
                    Recheck Cond: (w2 = *VALUES*_1.column1)
                    Rows Removed by Index Recheck: 19099252
                    Filter: (occurences > 1)
                    Rows Removed by Filter: 74051
                    Heap Blocks: exact=18304 lossy=311451
                    ->  Bitmap Index Scan on tdc_idx_w2  (cost=0.00..249.19 rows=22482 width=0) (actual time=122.725..122.725 rows=113135 loops=3)
                          Index Cond: (w2 = *VALUES*_1.column1)
Planning Time: 68.676 ms
JIT:
  Functions: 21
  Options: Inlining true, Optimization true, Expressions true, Deforming true
  Timing: Generation 5.749 ms, Inlining 219.374 ms, Optimization 370.011 ms, Emission 360.728 ms, Total 955.862 ms
Execution Time: 1756917.366 ms

Query B Alternative 1

-- B1 (AND)
EXPLAIN ANALYSE 
SELECT w1, w2, occurences
FROM public.token_doc_cooccurrences
INNER JOIN (
  VALUES (123), (555), (999)
) vals(v)
ON (w1 = v)
INNER JOIN (
  VALUES (123), (555), (999)
) vals2(w)
ON (w2 = w)
WHERE occurences > 1;

Successfully run. Total query runtime: 1 secs 5 msec.
3 rows affected. (Actual result w/o EXPLAIN ANALYSE)

Nested Loop  (cost=0.58..77.73 rows=1 width=12) (actual time=130.157..295.939 rows=3 loops=1)
  ->  Values Scan on ""*VALUES*_1""  (cost=0.00..0.04 rows=3 width=4) (actual time=0.005..0.019 rows=3 loops=1)
  ->  Nested Loop  (cost=0.58..25.87 rows=3 width=12) (actual time=54.355..98.622 rows=1 loops=3)
        ->  Values Scan on ""*VALUES*""  (cost=0.00..0.04 rows=3 width=4) (actual time=0.003..0.017 rows=3 loops=3)
        ->  Index Scan using tdc_idx_1 on T  (cost=0.58..8.60 rows=1 width=12) (actual time=32.853..32.855 rows=0 loops=9)
              Index Cond: ((w1 = ""*VALUES*"".column1) AND (w2 = ""*VALUES*_1"".column1))
              Filter: (occurences > 1)
Planning Time: 59.447 ms
Execution Time: 296.042 ms

Info
Between all measurements, I did this to clear cached queries:

systemctl stop postgresql
sync
echo 3 > /proc/sys/vm/drop_caches
systemctl start postgresql

== Edit ==

As @jjanes suggested, I added additional indices (w1, occurrences, w2) and (w2, occurrences, w1) and ran VACUUM.
Additionally, I set work_mem = 16MB. Afterwards I cleared the caches again and ran Query A Alternative 1 again. This is the result:

HashAggregate  (cost=3535.61..4052.95 rows=51734 width=12) (actual time=3240.753..3471.160 rows=173071 loops=1)
  Group Key: token_doc_cooccurrences.w1, token_doc_cooccurrences.w2, token_doc_cooccurrences.occurences
  Batches: 1  Memory Usage: 14353kB
  Buffers: shared hit=152549 read=993
  ->  Append  (cost=0.58..3147.60 rows=51734 width=12) (actual time=196.966..2927.069 rows=173074 loops=1)
        Buffers: shared hit=152549 read=993
        ->  Nested Loop  (cost=0.58..1642.71 rows=35898 width=12) (actual time=196.960..1948.256 rows=55822 loops=1)
              Buffers: shared hit=46960 read=533
              ->  Values Scan on ""*VALUES*""  (cost=0.00..0.04 rows=3 width=4) (actual time=0.007..0.024 rows=3 loops=1)
              ->  Index Only Scan using w1_occ_w2 on token_doc_cooccurrences  (cost=0.58..427.90 rows=11966 width=12) (actual time=85.026..603.453 rows=18607 loops=3)
                    Index Cond: ((w1 = ""*VALUES*"".column1) AND (occurences > 1))
                    Heap Fetches: 0
                    Buffers: shared hit=46960 read=533
        ->  Nested Loop  (cost=0.58..728.88 rows=15836 width=12) (actual time=40.764..573.493 rows=117252 loops=1)
              Buffers: shared hit=105589 read=460
              ->  Values Scan on ""*VALUES*_1""  (cost=0.00..0.04 rows=3 width=4) (actual time=0.008..0.027 rows=3 loops=1)
              ->  Index Only Scan using w2_occ_w1 on token_doc_cooccurrences token_doc_cooccurrences_1  (cost=0.58..190.16 rows=5279 width=12) (actual time=30.809..99.922 rows=39084 loops=3)
                    Index Cond: ((w2 = ""*VALUES*_1"".column1) AND (occurences > 1))
                    Heap Fetches: 0
                    Buffers: shared hit=105589 read=460
                    Planning:
  Buffers: shared hit=121 read=13
  Planning Time: 165.009 ms
  Execution Time: 3672.053 ms

Personally, I cannot tell what effect the setting or the index-only-scan separately have, but the result is great!

Successfully run. Total query runtime: 6 secs 122 msec.
173071 rows affected.

~6 seconds compared to almost 30 minutes.

score 2 · Accepted Answer · answered May 26 '21 at 19:28

   Rows Removed by Index Recheck: 20069348
   Heap Blocks: exact=902 lossy=107950

Your work_mem is too low for a query of this nature. That causes the bitmap to "go lossy", meaning it remembers only which blocks hold rows of interest, and not which rows in the block. That is why it has to recheck all the rows in each of the 107950 blocks, even thought it "knew" at one time that most of them didn't qualify. So increase work_mem. But I bet most of the time is in doing the IO to read in those blocks, not the CPU time of doing the rechecks. So while increasing work_mem is a good idea, it probably won't be magic. You can try to improve the IO by increasing effective_io_concurrency so that you can have many IO requests outstanding at once. How effective this will depend on your storage system and drivers.

You should set track_io_timing = on, and do your plans with EXPLAIN (ANALYZE, BUFFERS). That way we wouldn't have to guess that the time is spent on IO, you would actually get measurements.

For your alternative formulation of query A, that should be greatly improved by building indexes on both (w1, occurrences, w2) and on (w2, occurrences, w1). That way you can get index-only scans, which would generate a lot less random IO, and would also benefit from using the index to satisfy the restriction of occurrences > 1. If you put "occurrences" last rather than as the middle column, the resulting indexes might be more general for other queries as well, while still qualifying for index-only-scans, but would not be able to filter for occurrences > 1 as efficiently. But you do need your table to be well-vacuumed for index-only scans to be effective.

A completely different approach would be to buy better storage. You currently seem to have rotational drives, and rather slow ones at that. Switching to solid-state drives should give you a huge improvement.

Thank you a lot! I edited my question to show `EXPLAIN (ANALYZE, BUFFERS)`. It seems like the index-only scan makes the difference and the work_mem setting doesn't have a huge impact. — Daniello, May 27 '21 at 16:38