Postgres query planner won't use less expensive composite index

Question

We've run into a strange issue with Postgres, initially unearthed as we prepare to upgrade from 10.21 to 11+. What we'd initially thought was a performance regression in newer versions of Postgres appears, in fact, to be a performance regression once more accurate stats are generated for our database. As it turns out, both last_analyze and last_autoanalyze are NULL for the table in question. The upgrade corrects this, but makes a certain subset of queries much slower.

Background

We're currently running PG 10.21 in production. The table we're interacting with here, events_test includes two columns that we're testing on, account (varchar(40)) and timestamp (timestamptz). There exist indexes on account, timestamp and a composite index on account and timestamp.

Our test query is

EXPLAIN (ANALYZE, BUFFERS) 
SELECT * FROM events_test 
WHERE account = '1234' AND TIMESTAMP >= (NOW() AT TIME ZONE 'EST'  - interval '1 month');

We're testing it against a couple different account values — some we expect to return a large quantity of data (tens of thousands of rows for account 1234) and some a low quantity (tens of rows for account 5678).

On our current production DB, this runs quickly enough for our purposes, around 80-100ms for the large response (account 1234) and ~1ms for the small response (account 5678). In both cases, the query planner opts to start with an index scan on the composite index events_test_account_timestamp_idx.

Out of date stats

For the account returning a large number of rows, the query planner vastly underestimates the number of rows to be returned — it estimates ~500 rows, but gets around ~60,000. This is a result of the table stats being out of date on our production DB.

Manually running ANALYZE events_test;, or upgrading to PG11, PG12 or PG14 (which seems to cause this reanalysis to be run shortly after the upgrade) updates the table stats and causes the row estimate to be closer to the true value — the new estimate is on the order of ~20,000 rows for the account with more data returned.

Query planner

Unfortunately, this more-accurate estimate seems to result in the query planner running a much less efficient query. For the account with little data, the row estimate doesn't change and the planner continues to use the composite index. For the account with a larger amount of returned data, however, the planner now opts for an index scan on the events_test_timestamp_idx, followed by a filter on the account field, which takes the execution time of the query up to 400-500ms.

The query that expects a small number of rows to be returned continues to use the composite index on all versions of Postgres.

What we've tried

We've dug around on SO looking for similar questions about Postgres bypassing a composite index, but nothing we've found is of use. Also:

Analyze, reindex, stats target

We've tried updating the default_statistics_target from 100 to 1000 (and have also tried it briefly at 10 and 10000), followed by reanalyzing our table and reindexing, and the slower query plan has not changed.

Postgres versions

This issue was inintially revealed as we validated against newer versions of Postgres. We've tested after upgrading to PG11, PG12 and PG14 and the slower performance query planning shows up in all cases. We can also cause it to happen on PG10 by reanalyzing to update our stats, so it doesn't appear that this is a change in planner behaviour in a Postgres update.

Force composite index

We've verified that we can indeed force Postgres to use the composite index by (temporarily) dropping the timestamp index the planner is now preferring with this test command:

BEGIN;

DROP INDEX events_test_timestamp_idx;

EXPLAIN (ANALYZE, BUFFERS) 
SELECT * FROM events_test 
WHERE account = '1234' AND TIMESTAMP >= (NOW() AT TIME ZONE 'EST'  - interval '1 month');

ROLLBACK;

This causes Postgres to revert to the composite index events_test_account_timestamp_idx, however it now chooses to do a bitmap index scan on this index. This continues to run in under 100ms, and is comparable in performance to the previous straight index scan on the composite index.

Query `EXPLAIN`s

Here is what the query looks like when it's running more efficiently, in production, on PG10:

Index Scan using events_test_account_timestamp_idx on events_test  (cost=0.11..1109.11 rows=579 width=206) (actual time=0.062..116.439 rows=71700 loops=1)
  Index Cond: (((account)::text = '1234'::text) AND ("timestamp" >= (timezone('EST'::text, now()) - '1 mon'::interval)))
  Buffers: shared hit=68619 read=1249
  I/O Timings: read=23.426
Planning time: 0.662 ms
Execution time: 119.339 ms

And here is what it looks like once it has updated statistics, on PG14 (but the output is similar on all versions):

Index Scan using events_test_timestamp_idx on events_test  (cost=0.11..31867.79 rows=18994 width=204) (actual time=0.164..311.960 rows=55583 loops=1)
  Index Cond: ("timestamp" >= ((now() AT TIME ZONE 'EST'::text) - '1 mon'::interval))
  Filter: ((account)::text = '1234'::text)
  Rows Removed by Filter: 462327
  Buffers: shared hit=492867
Planning:
  Buffers: shared hit=144 read=30
  I/O Timings: read=6.160
Planning Time: 7.021 ms
Execution Time: 314.676 ms

And finally, if we force it to use the composite index by temporarily dropping the timestamp index, on PG14 (but again, similar on all versions):

Bitmap Heap Scan on events_test  (cost=187.05..35867.12 rows=18992 width=204) (actual time=11.373..38.937 rows=55582 loops=1)
  Recheck Cond: (((account)::text = '1234'::text) AND ("timestamp" >= ((now() AT TIME ZONE 'EST'::text) - '1 mon'::interval)))
  Heap Blocks: exact=13586
  Buffers: shared hit=13803
  ->  Bitmap Index Scan on events_test_account_timestamp_idx  (cost=0.00..186.10 rows=18992 width=0) (actual time=9.376..9.376 rows=55582 loops=1)
        Index Cond: (((account)::text = '1234'::text) AND ("timestamp" >= ((now() AT TIME ZONE 'EST'::text) - '1 mon'::interval)))
        Buffers: shared hit=217
Planning:
  Buffers: shared hit=5
Planning Time: 0.316 ms
Execution Time: 41.446 ms

Edit: It seems like the core issue here is that, with an accurate estimate, Postgres expects an index and filter on the single index to be faster (31867.79) than using the composite index (35867.12), when in fact the composite is nearly 10x faster in the above case. Which is interesting, given that the last two EXPLAINs above are using the same set of statistics to come up with these predictions.

On the newer system, what are the expected and actual rows if you run `EXPLAIN ANALYZE SELECT * FROM events_test WHERE account = '1234'` and `EXPLAIN ANALYZE SELECT * FROM events_test WHERE TIMESTAMP >= (NOW() AT TIME ZONE 'EST' - interval '1 month')`, with no LIMITs? — jjanes, Oct 11 '22 at 23:28
Maybe someone had disabled autovac on this specific table on the old system, specifically to work around this problem. Can you check that? — jjanes, Oct 11 '22 at 23:42
Could you show `EXPLAIN (ANALYZE, BUFFERS)` output for all three cases? Also, why don't you use the obvious index on `(account, "timestamp", entry_id)`? — Laurenz Albe, Oct 12 '22 at 07:24
@LaurenzAlbe — good question. The sort/limit on `entry_id` bit snuck in as part of testing but is not representative of a production query. I'l edit the question to include a simplfiied query that removes that entirely (and update the `EXPLAIN`s to include `BUFFERS`). — Sydin, Oct 12 '22 at 18:30
@jjanes on the first query (account), expected=1,730,390; actual=1,739,798. On the timestamp query, expected=527,621; actual=518,263. Autovac is enabled, but I'm not entirely confident when it last ran. On the index size: `events_test_account_timestamp_idx` is 1445mb, `events_test_timestamp_idx` is 1029mb, `events_test_account_idx` is 318mb. — Sydin, Oct 12 '22 at 18:31
On the v10 server, you can look at the pg_stats view for this tablename to see if it ever had stats collected. — jjanes, Oct 13 '22 at 00:45
@jjanes — yeah, while the values for `last_analyze` and `last_autoanalyze` are `NULL`, the actual analysis stats indicate to me that analysis was last performed about 18 months ago; the histogram for the `timestamp` field has no ranges newer than that. The lack of updated stats is certainly a curiosity. — Sydin, Oct 13 '22 at 17:37

score 1 · Accepted Answer · answered Oct 13 '22 at 02:29

I am pretty sure I know what is going on here, but I don't know what to do about it.

The timestamp column is highly correlated with the order of rows in table (that is, pg_stats.correlation is nearly 1 for this tablename and attname). (Which correlation I see you confirmed in a comment). The planner therefore assumes the range scan on that segment of the index on (timestamp) will end up reading a small chunk of the table, and doing so more or less sequentially, and so gives a big break on the IO cost for use of that index. However, for the use of the index on (account, timestamp), it does not got this break. The planner only looks at the correlation stats for the first column of the index, and since account is not correlated with row order, it assumes it is going to have hop all over the entire table doing random reads. This is wrong, because once you equate the specific account and set the boundaries on the timestamp column, that chunk of the index is still highly correlated with the row order so you are still reading the (small) relevant chunk of the table and doing it in sequential order, but the planner never looks far enough to realize this.

So the "wrong" index gets an unfair credit for an IO convenience which is true, but which is also true (but unknown) for the "right" index. But the plan with the wrong index needs to do a lot of filtering to remove the rows with wrong account values. This consumes CPU time, and the planner knows this, but the amount of CPU time burned is thought to be small compared to the amount of IO it incorrectly thinks will be saved.

One thing you could do about this is to set autovacuum_enabled = off for this table. But for this to take effect, you would have to recreate the table, and then populate the table after the change is made. The equivalent of this is I suspect what someone somehow managed to do on your old server. Another thing you could consider is partitioning. If you only had to consider the most recent partition (or two partitions), the correlation effect would be much smaller, as the planner would already know it is going to look at a much smaller table regardless of which index is used.

Another thing you could do, although quite grotty, is to manually monkey around with the statistics table. In my hands this did the job:

update pg_statistic set stanumbers2='{0}' where starelid=25929 and staattnum=2;

But of course in your system that relid would be different than mine, and the attnum probably as well, and I'm not sure stanumbers2 would be the right field to set on all systems. And this would just get overwritten again the next time analyze was done. And of course it might interfere with other plans the planner is trying to come up with, causing it to make choices there that are far worse than the benefit you might get here.

Thank you, this is super useful and makes a lot of sense. If the planner is using the first index column to make this determination, does it stand to reason that recreating the index with `timestamp` as the first indexed column would change the equation here? I don't know my own historical reasons for choosing the current index order, but I suspect it was just an arbitrary choice. The suggestions on partitioning and stats are also definitely worth a look — I'll dig into those more as well. — Sydin, Oct 13 '22 at 17:29
A quick update: creating a second composite index on `(timestamp, account)` instead causes the planner to prefer that ordered index over either the single index *or* the account-first composite index, and it improves the query speed as a result. (!) This is huge, I can't thank you enough for your help with this. — Sydin, Oct 13 '22 at 20:17
Interestingly, if forced to choose between only the composite indexes, the planner obviously chooses the timestamp-first one — predicting a cost of 23,573 vs a predicted cost of the account-first index at 39,459, but in practice, the account-first index is a faster operation: around 50ms vs 100ms for the same query on the preferred timestamp-first index. (I see in the PG documentation that for multi-column b-tree indexes you get efficiency gains if you're doing an equality constraint on the leading column, which may explain this.) — Sydin, Oct 13 '22 at 20:21
The index on `(account, timestamp)` would be ideal if only the planner would use it. It can use boundaries to eliminated all the disqualifying values. The one on `(timestamp, account)` can only set boundaries to eliminate the disqualifying timestamp values, it then has to individually test each account value to see if it passes. It can do this assessment in the index rather than in the table, but given the high correlation I would not expect this to a very big difference. — jjanes, Oct 14 '22 at 17:16
Yeah, the `account`-first index remains faster (although it's not a massive difference), but the `timestamp`-first index is fast enough that it gets the planner to pick it over the single index, which still a significant speed-up. While this isn't perfect, I think it's better than the alternative. Thanks again for all your help here, I'm going to mark this as solved, as it's how I arrived at the ultimate cause and work-around. — Sydin, Oct 17 '22 at 19:03

score 0 · Answer 2 · answered Oct 12 '22 at 17:29

0

Sometimes, when selecting small data, or when the data brought during the selection is much more than the remaining data, then PostgreSQL uses a filter instead of using an index, and on the contrary, deletes the remaining data. If you want to interfere with this and always use an index, then try checking execution plan and time with temporarily set enable_seqscan = off and if possible, read this as well, you may need it: Examining Index Usage

answered Oct 12 '22 at 17:29

Ramin Faracov

3,032
1
2
8

Thanks for the suggestion; unfortunately disabling `enable_seqscan` doesn't change the planning for the query in the question. It still opts to use the `timestamp` index and then filter rather than using the faster composite index. – Sydin Oct 12 '22 at 18:46
@Sydin - Are your indexes for the account and timestamp fields separate or composited? – Ramin Faracov Oct 13 '22 at 06:27
There exist indexes on `account`, `timestamp` and a composite index on `account` *and* `timestamp`. – Sydin Oct 13 '22 at 17:22

score 0 · Answer 3 · answered Oct 12 '22 at 20:17

0

I can't put my finger on the problem, but here a few considerations:

Did you create the composite index on (account, "timestamp")? The order matters, and this order would be better.
It seems that PostgreSQL expects that the condition on account is not very selective, but it is wrong. Perhaps extended statistics can improve its estimate:
```
CREATE STATISTICS events_test_account_timestamp (dependencies, mvc)
   ON account, "timestamp" FROM events_test;
ANALYZE events_test;
```
did you set random_page_cost the same on both systems?

answered Oct 12 '22 at 20:17

Laurenz Albe

209,280
17
206
263

In order: yes, the index is created on those fields in that order. I haven't tried extending the stats, but I'll take a look at that. I also haven't altered `random_page_cost`. I've added an edit to the end of my original post, but I believe it boils down to this: with accurate statistics, Postgres predicts the cost of an index and filter on the single index to be faster (31867.79) than using the composite index (35867.12), when in fact the composite is nearly 10x faster in the above case (41ms for the composite index, 314ms for the single index and filter). – Sydin Oct 12 '22 at 21:24
One relevant detail may be that the data in the table, while not strictly sorted by `timestamp`, is almost certainly going to be almost entirely in `timestamp` order (with the exception of rare race conditions at insertion time). I'm unsure whether being sorted-ish on that column would contribute to the query time performing differently to the prediction, and whether the stats would reflect that. – Sydin Oct 12 '22 at 21:33
1

There are no updates or deletes on the table? The fact that you get the correct plan with bad statistics is two mistakes canceling each other out. That's not something on the stability of which I would want to rely. – Laurenz Albe Oct 13 '22 at 08:49
Yeah, that's my conclusion too — with out of date statistics, the planner accidentally chooses a faster plan, but the core issue here is that the planner predicts a slower plan is faster for larger data lookups. – Sydin Oct 13 '22 at 17:23
(And given that upgrading from 10 to any new version appears to update the out-of-date stats, I soon won't be able to rely on it in any case. ) – Sydin Oct 13 '22 at 17:39