Summary
I am joining a large table (~600K rows) with a smaller one (~11K rows) on a PostgreSQL database and I need to filter the result set by a descriptive text
field.
When filtering by a bigint
field of the smaller table the optimizer estimates the resulting number of rows correctly, but when filtering by a text
field of the small table the optimizer underestimates the resulting number of rows by thousands of times, even though there's a 1-1 relation between the two.
I fail to understand this behavior.
The whole environment, including the data, can be setup with the instructions found in this pastebin. It is too large for conventional online mock DBs.
Environment
select version();
|version |
|---------------------------------------------------------------------------------------------------|
|PostgreSQL 13.6 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0, 64-bit|
Sitting on Azure Flexible Server.
Structures and data
Parent table (11228 rows):
create table parent_tb as
select id, md5(random()::text) descr
from generate_series(1::bigint,11228::bigint) as a(id);
alter table parent_tb add primary key (id);
create index idx_parent_tb_desc on parent_tb(descr);
Sample data:
select *
from parent_tb
limit 3;
|id |descr |
|---|--------------------------------|
|1 |a34ea09794a959c333116a83b3b77700|
|2 |6c40f8248d28dc541724e86d17f96775|
|3 |99f218398cd5125c803edf7eccc9c832|
Child table (597057 rows):
create table child_tb (
id_tenant bigint,
id bigint,
descr text,
id_ref bigint references parent_tb(id) null,
primary key (id_tenant, id)
) partition by list(id_tenant);
The child_tb
table is partitioned for every id_tenant
value from 1
to 35
:
create table child_tb_ten_1 partition of child_tb for values in ('1');
-- ...
create table child_tb_ten_1 partition of child_tb for values in ('35');
Sample data:
select *
from child_tb
limit 3;
|id_tenant|id |descr |id_ref|
|---------|---|--------------------------------|------|
|1 |1 |638d2b2aa799d4871e0e2fa73ae607de|1 |
|1 |2 |f06668f8df7eaee3c539d2a1ba613604|1 |
|1 |3 |f588557239d37ec301bae79ab9a61742|1 |
Collect data for test case
Collect statistics:
set default_statistics_target=10000;
vacuum analyze parent_tb;
vacuum analyze child_tb;
Pick a large id_tenant
:
select count(1) n_keys, id_tenant, count(distinct id_ref) n_le, array_agg(distinct (id_ref,p.descr))
from child_tb c join parent_tb p on (c.id_ref=p.id)
group by id_tenant
order by 1 desc
limit 3;
|n_keys|id_tenant|n_le|array_agg |
|------|---------|----|------------------------------------------|
|475759|6 |1 |{"(53,6ea8c6d951f3c8371662509ff8a5e37e)"} |
|18352 |14 |1 |{"(4,a0b360d0344c7018aa98d511392aa26f)"} |
|17102 |2 |1 |{"(17,43595b7e1b092d120bbc8a94afca9583)"} |
Tenant 6
is clearly an outlier at a global level.
The statistics recognize that the first Most Common Values and Most Common Frequencies are related to tenant 6
which retains the ~80% of the data:
select
(most_common_vals::text::numeric[])[1] mcv_1, (most_common_freqs::text::numeric[])[1] mcf_1,
(most_common_vals::text::numeric[])[2] mcv_2, (most_common_freqs::text::numeric[])[2] mcf_2,
*
from pg_stats where tablename in ('child_tb','child_tb_ten_6');
|mcv_1|mcf_1 |mcv_2|mcf_2 |schemaname|tablename |attname |
|-----|---------|-----|-----------|----------|--------------|---------|
|6 |0.7968402|14 |0.030737434|argodb |child_tb |id_tenant|
| | | | |argodb |child_tb |id |
| | | | |argodb |child_tb |descr |
|53 |0.7968402|4 |0.030737434|argodb |child_tb |id_ref |
|6 |1 | | |argodb |child_tb_ten_6|id_tenant|
| | | | |argodb |child_tb_ten_6|id |
| | | | |argodb |child_tb_ten_6|descr |
|53 |1 | | |argodb |child_tb_ten_6|id_ref |
Get the data from tenant 6
which will be used to filter the final query:
select distinct c.id_ref,p.descr
from child_tb c join parent_tb p on (c.id_ref=p.id)
where id_tenant=6;
|id_ref|descr |
|------|--------------------------------|
|53 |6ea8c6d951f3c8371662509ff8a5e37e|
Behavior
Please note: every execution plan shown here is the result of an explain analyze
and therefore represents the actual case.
First attempt — Access by Parent's ID.
Result OK — The optimizer correctly recognizes the number of rows resulting from the JOIN.
explain analyze
select * from child_tb c join parent_tb p on (c.id_ref=p.id)
where p.id=53 --fill with the appropriate value at point (1)
and c.id_tenant=6;
|QUERY PLAN |
|---------------------------------------------------------------------------------------------------------------------------------|
|Nested Loop (cost=0.29..17305.28 rows=475759 width=98) (actual time=0.025..88.812 rows=475759 loops=1) |
| -> Index Scan using parent_tb_pkey on parent_tb p (cost=0.29..4.30 rows=1 width=41) (actual time=0.012..0.014 rows=1 loops=1)|
| Index Cond: (id = 53) |
| -> Seq Scan on child_tb_ten_6 c (cost=0.00..12543.38 rows=475759 width=57) (actual time=0.011..52.590 rows=475759 loops=1) |
| Filter: ((id_ref = 53) AND (id_tenant = 6)) |
|Planning Time: 0.183 ms |
|Execution Time: 103.176 ms |
Second attempt — Access by Parent's descriptive field.
Result KO — The optimizer underestimates the resulting records out of the join by 1000x!
explain analyze
select * from child_tb c join parent_tb p on (c.id_ref=p.id)
where p.descr='6ea8c6d951f3c8371662509ff8a5e37e' --fill with the appropriate value at point (1)
and c.id_tenant=6;
|QUERY PLAN |
|-------------------------------------------------------------------------------------------------------------------------------------------------|
|Gather (cost=1004.32..9413.99 rows=42 width=98) (actual time=0.409..119.637 rows=475759 loops=1) |
| Workers Planned: 2 |
| Workers Launched: 2 |
| -> Hash Join (cost=4.32..8409.79 rows=18 width=98) (actual time=0.074..44.309 rows=158586 loops=3) |
| Hash Cond: (c.id_ref = p.id) |
| -> Parallel Seq Scan on child_tb_ten_6 c (cost=0.00..7884.91 rows=198233 width=57) (actual time=0.007..16.501 rows=158586 loops=3) |
| Filter: (id_tenant = 6) |
| -> Hash (cost=4.30..4.30 rows=1 width=41) (actual time=0.020..0.021 rows=1 loops=3) |
| Buckets: 1024 Batches: 1 Memory Usage: 9kB |
| -> Index Scan using idx_parent_tb_desc on parent_tb p (cost=0.29..4.30 rows=1 width=41) (actual time=0.016..0.017 rows=1 loops=3)|
| Index Cond: (descr = '6ea8c6d951f3c8371662509ff8a5e37e'::text) |
|Planning Time: 0.620 ms |
|Execution Time: 134.484 ms |
Additional attempt — Obfuscate the Parent's accessing value with a materialized CTE.
The optimizer assumes a flat data distribution dividing the total number of rows of the Tenant 6
partition to the the global number of distinct id_ref
s. This behavior is expected.
select
(select count(1) from child_tb where id_tenant=6) cnt_child_6,
(select count(distinct id_ref) from child_tb) cnt_child_dist_refs,
(select count(1) from child_tb where id_tenant=6)/(select count(distinct id_ref) from child_tb) cnt_flat_6_distr
|cnt_child_6|cnt_child_dist_refs|cnt_flat_6_distr|
|-----------|-------------------|----------------|
|475759 |61 |7799 |
Result OK — The optimizer correctly assumes the number of rows resulting from the JOIN.
explain analyze
with par as materialized (
select *
from parent_tb
where descr='6ea8c6d951f3c8371662509ff8a5e37e' --fill with the appropriate value at point (1)
)
select * from child_tb c join par p on (c.id_ref=p.id)
where c.id_tenant=6;
|QUERY PLAN |
|-------------------------------------------------------------------------------------------------------------------------------------|
|Hash Join (cost=4.33..13220.41 rows=7799 width=97) (actual time=0.031..124.799 rows=475759 loops=1) |
| Hash Cond: (c.id_ref = p.id) |
| CTE par |
| -> Index Scan using idx_parent_tb_desc on parent_tb (cost=0.29..4.30 rows=1 width=41) (actual time=0.014..0.015 rows=1 loops=1)|
| Index Cond: (descr = '6ea8c6d951f3c8371662509ff8a5e37e'::text) |
| -> Seq Scan on child_tb_ten_6 c (cost=0.00..11353.99 rows=475759 width=57) (actual time=0.009..45.301 rows=475759 loops=1) |
| Filter: (id_tenant = 6) |
| -> Hash (cost=0.02..0.02 rows=1 width=40) (actual time=0.017..0.018 rows=1 loops=1) |
| Buckets: 1024 Batches: 1 Memory Usage: 9kB |
| -> CTE Scan on par p (cost=0.00..0.02 rows=1 width=40) (actual time=0.015..0.015 rows=1 loops=1) |
|Planning Time: 0.150 ms |
|Execution Time: 138.972 ms |
Conclusion
I fail to understand the 42
estimated rows by the optimizer in the Second attempt. I'd expect it to realize the 1-1 relation between the UK and the PK and use that one, like in the First attempt. Moreover, the 42
value looks arbitrary to me since I can't figure out from where it originates, unlike the 7799
estimate from the Additional attempt.