3

I am having problems optimizing a query in PostgreSQL 9.5.14.

select *
from file as f
join product_collection pc on (f.product_collection_id = pc.id)
where pc.mission_id = 7
order by f.id asc
limit 100;

Takes about 100 seconds. If I drop the limit clause it takes about 0.5:

With limit:

explain (analyze,buffers) ... -- query exactly as above
                                                                             QUERY PLAN                                                                             
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.84..859.32 rows=100 width=457) (actual time=102793.422..102856.884 rows=100 loops=1)
   Buffers: shared hit=222430592
   ->  Nested Loop  (cost=0.84..58412343.43 rows=6804163 width=457) (actual time=102793.417..102856.872 rows=100 loops=1)
         Buffers: shared hit=222430592
         ->  Index Scan using file_pkey on file f  (cost=0.57..23409008.61 rows=113831736 width=330) (actual time=0.048..28207.152 rows=55858772 loops=1)
               Buffers: shared hit=55652672
         ->  Index Scan using product_collection_pkey on product_collection pc  (cost=0.28..0.30 rows=1 width=127) (actual time=0.001..0.001 rows=0 loops=55858772)
               Index Cond: (id = f.product_collection_id)
               Filter: (mission_id = 7)
               Rows Removed by Filter: 1
               Buffers: shared hit=166777920
 Planning time: 0.803 ms
 Execution time: 102856.988 ms

Without limit:

=>  explain (analyze,buffers)  ... -- query as above, just without limit
                                                                                 QUERY PLAN                                                                                 
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=20509671.01..20526681.42 rows=6804163 width=457) (actual time=456.175..510.596 rows=142055 loops=1)
   Sort Key: f.id
   Sort Method: quicksort  Memory: 79392kB
   Buffers: shared hit=37956
   ->  Nested Loop  (cost=0.84..16494851.02 rows=6804163 width=457) (actual time=0.044..231.051 rows=142055 loops=1)
         Buffers: shared hit=37956
         ->  Index Scan using product_collection_mission_id_index on product_collection pc  (cost=0.28..46.13 rows=87 width=127) (actual time=0.017..0.101 rows=87 loops=1)
               Index Cond: (mission_id = 7)
               Buffers: shared hit=10
         ->  Index Scan using file_product_collection_id_index on file f  (cost=0.57..187900.11 rows=169535 width=330) (actual time=0.007..1.335 rows=1633 loops=87)
               Index Cond: (product_collection_id = pc.id)
               Buffers: shared hit=37946
 Planning time: 0.807 ms
 Execution time: 569.865 ms

I have copied the database to a backup server so that I may safely manipulate the database without something else changing it on me.

Cardinalities:
Table file: 113,831,736 rows.
Table product_collection: 1370 rows.
The query without LIMIT: 142,055 rows.
SELECT count(*) FROM product_collection WHERE mission_id = 7: 87 rows.

What I have tried:

  • searching stack overflow
  • vacuum full analyze
  • creating two column indexes on file.product_collection_id & file.id. (there already are single column indexes on every field touched.)
  • creating two column indexes on file.id & file.product_collection_id.
  • increasing the statistics on file.id & file.product_collection_id, then re-vacuum analyze.
  • changing various query planner settings.
  • creating non-materialized views.
  • walking up and down the hallway while muttering to myself.

None of them seem to change the performance in a significant way.

Thoughts?

UPDATE from OP:
Tested this on PostgreSQL 9.6 & 10.4, and found no significant changes in plans or performance.

However, setting random_page_cost low enough is the only way to get faster performance on the without limit search.

With a default random_page_cost = 4, the without limit:

                                                                           QUERY PLAN                                                                           
----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=9270013.01..9287875.64 rows=7145054 width=457) (actual time=47782.523..47843.812 rows=145697 loops=1)
   Sort Key: f.id
   Sort Method: external sort  Disk: 59416kB
   Buffers: shared hit=3997185 read=1295264, temp read=7427 written=7427
   ->  Hash Join  (cost=24.19..6966882.72 rows=7145054 width=457) (actual time=1.323..47458.767 rows=145697 loops=1)
         Hash Cond: (f.product_collection_id = pc.id)
         Buffers: shared hit=3997182 read=1295264
         ->  Seq Scan on file f  (cost=0.00..6458232.17 rows=116580217 width=330) (actual time=0.007..17097.581 rows=116729984 loops=1)
               Buffers: shared hit=3997169 read=1295261
         ->  Hash  (cost=23.08..23.08 rows=89 width=127) (actual time=0.840..0.840 rows=87 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 15kB
               Buffers: shared hit=13 read=3
               ->  Bitmap Heap Scan on product_collection pc  (cost=4.97..23.08 rows=89 width=127) (actual time=0.722..0.801 rows=87 loops=1)
                     Recheck Cond: (mission_id = 7)
                     Heap Blocks: exact=10
                     Buffers: shared hit=13 read=3
                     ->  Bitmap Index Scan on product_collection_mission_id_index  (cost=0.00..4.95 rows=89 width=0) (actual time=0.707..0.707 rows=87 loops=1)
                           Index Cond: (mission_id = 7)
                           Buffers: shared hit=3 read=3
 Planning time: 0.929 ms
 Execution time: 47911.689 ms

User Erwin's answer below will take me some time to fully understand and generalize to all of the use cases needed. In the mean time we will probably use either a materialized view or just flatten our table structure.

user2437173
  • 121
  • 1
  • 6
  • Just out of curiosity can you try setting `set join_collapse_limit to 1` for limit query? – Slava Lenskyy Dec 07 '18 at 15:43
  • And can you also make the two queries identical? the select list is different – Slava Lenskyy Dec 07 '18 at 15:47
  • No changes to query plan with either. – user2437173 Dec 07 '18 at 15:52
  • Skoffer, ... D'oh! So may things tried, copied and pasted the wrong explain block. I will update as soon as the query is done, but it was the same. – user2437173 Dec 07 '18 at 15:55
  • Can you change the order of the tables as well? like that `join product_collection pc join file` and then apply `join_collapse_limit to 1` – Slava Lenskyy Dec 07 '18 at 15:55
  • I used `select * from product_collection pc join file f on (pc.id = f.product_collection_id) where pc.mission_id =7 order by f.id asc limit 100;` and got the exact same query plan as above, with and without join collapse limit =1. Trivial differences in actual time, same cost and buffers and plan. – user2437173 Dec 07 '18 at 16:04
  • Please consider instructions in the [tag-info of \[postgresql-performance\]](https://stackoverflow.com/tags/postgresql-performance/info). Most importantly: `CREATE TABLE` scripts and exact index definitions. Is it always `pc.mission_id = 7`? Which parts of your query are changeable? Also, are you bound to the outdated pg 9.5? – Erwin Brandstetter Dec 07 '18 at 17:47
  • And do you need `SELECT *`? And `LIMIT 100`? Do you *need* `order by f.id`? Be precise about your requirements. – Erwin Brandstetter Dec 07 '18 at 18:07

1 Answers1

4

This query is harder for the Postgres query planner than it might look. Depending on cardinalities, data distribution, value frequencies, sizes, ... completely different query plans can prevail and the planner has a hard time predicting which is best. Current versions of Postgres are better at this in several aspects, but it's still hard to optimize.

Since you retrieve only relatively few rows from product_collection, this equivalent query with LIMIT in a LATERAL subquery should avoid performance degradation:

SELECT *
FROM   product_collection pc
CROSS  JOIN LATERAL (
   SELECT *
   FROM   file f  -- big table
   WHERE  f.product_collection_id = pc.id
   ORDER  BY f.id
   LIMIT  100
   ) f
WHERE  pc.mission_id = 7
ORDER  BY f.id
LIMIT  100;

Edit: This results in a query plan with explain (analyze,verbose) provided by the OP:

                                                                         QUERY PLAN                                                                          
-------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=30524.34..30524.59 rows=100 width=457) (actual time=13.128..13.167 rows=100 loops=1)
   Buffers: shared hit=3213
   ->  Sort  (cost=30524.34..30546.09 rows=8700 width=457) (actual time=13.126..13.152 rows=100 loops=1)
         Sort Key: file.id
         Sort Method: top-N heapsort  Memory: 76kB
         Buffers: shared hit=3213
         ->  Nested Loop  (cost=0.57..30191.83 rows=8700 width=457) (actual time=0.060..9.868 rows=2880 loops=1)
               Buffers: shared hit=3213
               ->  Seq Scan on product_collection pc  (cost=0.00..69.12 rows=87 width=127) (actual time=0.024..0.336 rows=87 loops=1)
                     Filter: (mission_id = 7)
                     Rows Removed by Filter: 1283
                     Buffers: shared hit=13
               ->  Limit  (cost=0.57..344.24 rows=100 width=330) (actual time=0.008..0.071 rows=33 loops=87)
                     Buffers: shared hit=3200
                         ->  Index Scan using file_pc_id_index on file  (cost=0.57..582642.42 rows=169535 width=330) (actual time=0.007..0.065 rows=33 loops=87)
                           Index Cond: (product_collection_id = pc.id)
                           Buffers: shared hit=3200
 Planning time: 0.595 ms
 Execution time: 13.319 ms

You need these indexes (will help your original query, too):

CREATE INDEX idx1 ON file (product_collection_id, id);     -- crucial
CREATE INDEX idx2 ON product_collection (mission_id, id);  -- helpful

You mentioned:

two column indexes on file.id & file.product_collection_id.

Etc. But we need it the other way round: id last. The order of index expressions is crucial. See:

Rationale: With only 87 rows from product_collection, we only fetch a maximum of 87 x 100 = 8700 rows (fewer if not every pc.id has 100 rows in table file), which are then sorted before picking the top 100. Performance degrades with the number of rows you get from product_collection and with bigger LIMIT.

With the multicolumn index idx1 above, that's 87 fast index scans. The rest is not very expensive.

More optimization is possible, depending on additional information. Related:

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228