Currently I'm facing a problem where Postgres's query planer makes bad decisions based on (what I think) seems to be bad estimates, i.e. in a larger query the query planer chooses to do the (hash) join from this post as it's first/inner-most part because the estimate on this join is only 274 rows but actually there are 31770 rows when joining these two tables. That leads to nested loops over these 31770 rows in the larger query, although there are definitely cheaper paths when the query planer would have considered/known the correct row count, i.e. the estimate would have been better.
This can be used to reproduce the problem:
CREATE TABLE b (id int primary key, name text, is_visible boolean);
INSERT INTO b (id, name, is_visible) SELECT x, 'B #' || x, CASE WHEN x % 116 = 0 THEN true ELSE false END FROM generate_series(1, 29499) AS x;
CREATE INDEX ON b (is_visible);
CREATE TABLE a (id bigserial primary key, b_id int references b(id), name text);
WITH dist AS (
SELECT '{7236,4431,3012,2339,2246,1907,1661,1356,1173,1029,533,505,415,354,336,275,188,168,168,153,152,133,126,125,113,90,73,72,65,64,64,48,46,35,34,31,26,26,26,25,25,25,24,22,22,21,20,20,19,19,15,15,15,15,13,13,12,12,12,12,12,11,11,11,11,8,8,8,8,8,8,8,8,7,7,7,6,6,6,6,6,6,6,6,6,6,5,5,5,5,5,5,5,5,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1}'::INT[] cnt
)
INSERT INTO a (b_id, name)
SELECT x.id, x.index || '/' || y
FROM (SELECT id, row_number() over (ORDER BY id) AS index FROM b WHERE is_visible = true) AS x,
generate_series(1, (SELECT cnt[x.index] FROM dist)) AS y
ORDER BY x.id;
I tried to achieve the same distribution of the data, hence the strange array with counts. And that actually seems to have worked, because when I execute the following query on the test data the estimates and actual rows are exactly the same:
postgres=> explain analyze select * from a join b on b.id = a.b_id where b.is_visible=true;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=35.88..841.99 rows=274 width=31) (actual time=5.251..637.037 rows=31770 loops=1)
Hash Cond: (a.b_id = b.id)
-> Seq Scan on a (cost=0.00..722.70 rows=31770 width=18) (actual time=0.020..205.874 rows=31770 loops=1)
-> Hash (cost=32.70..32.70 rows=254 width=13) (actual time=5.195..5.212 rows=254 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 20kB
-> Index Scan using b_is_visible_idx on b (cost=0.29..32.70 rows=254 width=13) (actual time=0.044..2.728 rows=254 loops=1)
Index Cond: (is_visible = true)
Planning Time: 0.374 ms
Execution Time: 836.821 ms
(9 rows)
What can I do to optimize the estimate?
[Edit]
I may have asked the wrong question, but I'm actually not interested in optimizing this particular query, but rather understanding where the estimate of 274 comes from and how I get a closer estimation to the actual row count of 31770
[Edit 2]
Using Postgres 12.4