0

The table in question looks similar to this:

create table metrics_measurement
(
    id                       bigint generated by default as identity
        constraint metrics_measurement_pkey
            primary key,
    measured_at              timestamp with time zone not null,
    measured_value           double precision         not null,
    metric_id                integer                  not null
        constraint metrics_measurement_metric_id_24fcec65_fk_metrics_metric_id
            references metrics_metric
            deferrable initially deferred,
    processed_into_aggregate boolean                  not null,
    created_at               timestamp with time zone not null,
    updated_at               timestamp with time zone not null
);

The query in question is:

SELECT * 
FROM "metrics_measurement" 
WHERE "metrics_measurement"."metric_id" = 6433 
ORDER BY "metrics_measurement"."measured_at" ASC 
LIMIT 1;

What I would expect to happen is Postgres uses the index on metric_id to filter down to the relevant rows and then do a sort on the remainder. This is what it's actually trying to do:

Limit  (cost=722051.34..722051.46 rows=1 width=45) (actual time=50814.355..50814.446 rows=1 loops=1)
  ->  Gather Merge  (cost=722051.34..770982.43 rows=419380 width=45) (actual time=50814.354..50814.444 rows=1 loops=1)
        Workers Planned: 2
        Workers Launched: 0
        ->  Sort  (cost=721051.32..721575.54 rows=209690 width=45) (actual time=50814.141..50814.141 rows=1 loops=1)
              Sort Key: measured_at
              Sort Method: top-N heapsort  Memory: 25kB
              ->  Parallel Seq Scan on metrics_measurement  (cost=0.00..720002.87 rows=209690 width=45) (actual time=24.851..50765.952 rows=489909 loops=1)
                    Filter: (metric_id = 6433)
                    Rows Removed by Filter: 48938488
Planning Time: 0.140 ms
Execution Time: 50814.467 ms

If I SET enambe_seqscan TO OFF, then it does as expected and is like 2 orders of magnitude faster:

Limit  (cost=723454.64..723454.75 rows=1 width=45) (actual time=494.140..494.773 rows=1 loops=1)
  ->  Gather Merge  (cost=723454.64..772444.06 rows=419880 width=45) (actual time=494.139..494.771 rows=1 loops=1)
        Workers Planned: 2
        Workers Launched: 2
        ->  Sort  (cost=722454.61..722979.46 rows=209940 width=45) (actual time=481.618..481.619 rows=1 loops=3)
              Sort Key: measured_at
              Sort Method: top-N heapsort  Memory: 25kB
              Worker 0:  Sort Method: top-N heapsort  Memory: 25kB
              Worker 1:  Sort Method: top-N heapsort  Memory: 25kB
              ->  Parallel Bitmap Heap Scan on metrics_measurement  (cost=5753.32..721404.91 rows=209940 width=45) (actual time=76.833..467.373 rows=163306 loops=3)
                    Recheck Cond: (metric_id = 6433)
                    Heap Blocks: exact=2308
                    ->  Bitmap Index Scan on metrics_measurement_metric_id_24fcec65  (cost=0.00..5627.35 rows=503855 width=0) (actual time=85.310..85.310 rows=489917 loops=1)
                          Index Cond: (metric_id = 6433)
Planning Time: 0.140 ms
Execution Time: 494.802 ms

I've already attempted ANALYZEing the table and running alter table metrics_measurement alter metric_id set statistics 10000 and then ANALYZEing again and the planner still refuses to use the index available to it. I've tried to look at the metrics and such it uses to see what's wrong but it's a bit out of my knowledge to know what to look for there.

So, my question, short of setting SET enable_seqscan TO OFF in every connection (which I feel like would be Bad), what would be the next steps in figuring out why Postgres isn't doing the optimal thing here?

The Impaler
  • 45,731
  • 9
  • 39
  • 76
Kenny Loveall
  • 519
  • 4
  • 16
  • Have you run `ANALYZE metrics_measurement`? From the differences between the estimated and actual row counts, it appears that the table's statistics might need to be refreshed. – JohnH Jul 07 '23 at 16:44
  • As an estimate, what percentage of the rows will match the predicate `metric_id = 6433 `? Seems like the engine thinks it's a high percentage. – The Impaler Jul 07 '23 at 17:25
  • 1
    Create the index: `create index ix1 on metrics_measurement (metric_id, measured_at);` – The Impaler Jul 07 '23 at 17:25
  • @JohnH, yeah, I've run that a couple of times and also done `alter table metrics_measurement alter metric_id set statistics 10000` and the done it again to no avail (they take a couple minutes each time, that's a reasonable amount of time for it to take right?) – Kenny Loveall Jul 07 '23 at 17:57
  • @TheImpaler: This has 493k rows that match the WHERE (although I just need the latest one) out of roughly 51.5m rows (so ~1% or so). I guess that's "high" compared to some of the other items in the table. I checked another metric_id that only had ~900ish rows matching and that does use the index as expected. Is it possible to tune that threshold of matching% that the planner chooses todo a table scan vs an index lookup? – Kenny Loveall Jul 07 '23 at 18:04
  • Regarding the compound index, we do plan to do that, but I wanted to understand this first to make sure there weren't deeper issues and that adding the additional index would help. – Kenny Loveall Jul 07 '23 at 18:04
  • Running `ALTER TABLE metrics_measurement ALTER COLUMN metric_id SET (n_distinct = n)`, where `n` is the number of actual distinct values of `metric_id`, might improve the planner's choice. It's possible that the frequency distribution of `metric_id` values caused the estimated number of distinct values to be skewed. – JohnH Jul 07 '23 at 18:28
  • 1
    The fact that your first query planned on 2 parallel workers but launched 0 suggest that the server may have been overloaded at the time it was run. – jjanes Jul 07 '23 at 19:35
  • The main difference is probably that the 2nd query found most of the pages it needs already in memory, while the first one did not. Including the buffer counts with `EXPLAIN (ANALYZE, BUFFERS)` might help clarify that, and even better if track_io_timing is on. – jjanes Jul 07 '23 at 19:41
  • @JohnH I think the estimates are pretty good. For the Parallel Seq Scan, the estimate is for 1 of 3 parallel processes, while the actual count is for all the rows (since only one process was actually used, that one process saw all the rows), so being off by so much is just an illusion. – jjanes Jul 07 '23 at 19:44
  • So, we ended up going ahead and tried adding the index suggested above and now the planner is correctly using that for everything and all is well. Also yes, the server was severely overloaded at the time due to this issue which also might have been contributing to the weirdness. – Kenny Loveall Jul 07 '23 at 20:03
  • To tune your query up add more restrictions to where clause. If certain dates of measured_at are not relevant add a condition, since 419880 rows are too many to get 1 in the final result. – Pepe N O Jul 07 '23 at 20:53
  • Also check [use a particular index](https://stackoverflow.com/questions/309786/how-do-i-force-postgres-to-use-a-particular-index) – Pepe N O Jul 07 '23 at 21:01

0 Answers0