PostgreSQL: Underperforming query on large table with composite key

Question

We have a table of 180m rows, 20 GB in size. Table DDL is:

create table app.table
(
    a_id    integer   not null,
    b_id    integer   not null,
    c_id    integer   not null,
    d_id    integer   not null,
    e_id    integer   not null,
    f_id    integer   not null,
    a_date  timestamp not null,
    date_added          timestamp,
    last_date_modified  timestamp default now()
);

Value distributions:

a_id has a range of 0-160,000,000
b_id has one value (this table is a copy of a single partition of a partitioned table, and this ID happened to be the partition key)
c_id has a range of 0-4
d_id has one value (currently)
e_id has one value (currently)

The primary key is a composite key:

alter table app.table add constraint table_pk primary key (a_id, b_id, c_id, d_id, e_id);

We're running a r6g.xlarge cluster in Aurora PostgreSQL v12.8. It's one instance with no other traffic hitting it. We've ran ANALYZE and VACUUM ANALYZE against the table:

INFO:  "table": scanned 30000 of 1711284 pages, containing 3210000 live
 rows and 0 dead rows; 30000 rows in sample, 183107388 estimated total rows

Problem

This query takes 9 seconds to run when shared_buffers is cold (or as cold as we can get it):

select a_id, b_id, c_id, d_id, a_date
from app.table ts
where a_id in ( <5000 values> )
and b_id = 34
and c_id in (2,3)
and d_id = 0

EXPLAIN output:

Index Scan using table_pk on table ts  (cost=0.57..419134.91 rows=237802 width=24) (actual time=8.335..9803.424 rows=5726 loops=1)
"  Index Cond: ((a_id = ANY ('{66986803,90478329,...,121697593}'::integer[])) AND (b_id = 34))"
"  Filter: (c_id = ANY ('{2,3}'::integer[])))"
  Rows Removed by Filter: 3
  Buffers: shared hit=12610 read=10593
  I/O Timings: read=9706.055
Planning:
  Buffers: shared hit=112 read=29
  I/O Timings: read=29.227
Planning Time: 33.437 ms
Execution Time: 9806.271 ms

We think this is unreasonably slow. When the query is ran again, and thus comes from cache, the time it takes is 25 ms. We'd rather not prewarm if possible.

In any case, we'd rather have better performance for this sort of query, around the 1-2 second mark if possible. Any ideas on how we could improve the performance?

EDIT - Effect of adding a covering index:

Tried adding a covering index to include the "a_date":

create unique index covering_idx on app.table (a_id, b_id, c_id, d_id, e_id) include (a_date)

EXPLAIN results after re-running the query (with cold shared_buffers cache):

Index Only Scan using covering_idx on table ts (cost=0.57..28438.58 rows=169286 width=24) (actual time=8.020..7028.442 rows=5658 loops=1)
  Index Cond: ((a_id = ANY ('{134952505,150112033,…,42959574}'::integer[])) AND (b_id = 34))
  Filter: ((e_id = ANY ('{0,0}'::integer[])) AND (c_id = ANY ('{2,3}'::integer[])))
  Rows Removed by Filter: 2
  Heap Fetches: 0
  Buffers: shared hit=12353 read=7733
  I/O Timings: read=6955.935
Planning:
  Buffers: shared hit=80 read=8
  I/O Timings: read=8.458
Planning Time: 11.930 ms
Execution Time: 7031.054 ms

Effect when using Bitmap Heap Scan vs. Index Scan:

We've discovered that we get a speed up when the query is executed using a Bitmap Heap Scan, rather than an Index Scan. We found this by forcing the plan using pg_hint_plan:

When adding /*+ BitmapScan(table) */:

Bitmap Heap Scan on table ts (cost=22912.96..60160.79 rows=9842 width=24) (actual time=3972.237..4063.417 rows=5657 loops=1)
  Recheck Cond: ((a_id = ANY ('{24933126,19612702,27100661,73628268,...,150482461}'::integer[])) AND (b_id = 34))
  Filter: ((d_id = ANY ('{0,0}'::integer[])) AND (c_id = ANY ('{2,3}'::integer[])))
 Rows Removed by Filter: 4
  Heap Blocks: exact=5644
  Buffers: shared hit=14526 read=11136
  I/O Timings: read=22507.527
  ->  Bitmap Index Scan on table_pk (cost=0.00..22898.00 rows=9842 width=0) (actual time=3969.920..3969.920 rows=5661 loops=1)
       Index Cond: ((a_id = ANY ('{24933126,19612702,27100661,,150482461}'::integer[])) AND (b_id = 34))
       Buffers: shared hit=14505 read=5513
       I/O Timings: read=3923.878
Planning:
  Buffers: shared hit=6718
Planning Time: 21.493 ms
{Execution Time: 4066.582 ms

Currently, we are thinking of forcing this plan in production using pg_hint_plan - but we'd rather know why the planner is opting for a less optimal plan! We have run VACUUM ANALYZE with default_statistics_target of 1000.

It seems to be just the IO spent on fetching the records, since it is using the index. Have you considered partitioning this table? — Vinicius, Aug 12 '22 at 10:06
I just realised that this is a copy from a partition from another table :P Yet, a 20GB table seems to be a candidate for further partitioning. — Vinicius, Aug 12 '22 at 10:09
We could partition it further, but this would only mean we would end up querying across partitions. As I understand it, partitioning should aim to allow you to hit as few partitions as possible, which this would violate. — Robert Hargreaves, Aug 12 '22 at 10:15
It all depends on the partitioning key ranges... Without knowing the full use case it's hard to say. — Vinicius, Aug 12 '22 at 11:01
We're currently partitioned on "b_id" - where we have 150 odd partitions. We could also partition on "b_id & c_id", which would likely split the existing partitions into thirds (there's 3 or so possible values for c_id). But we would have queries which need to query for all 3 values of c_id, so the query would end up hitting at least 3 partitions in that case. — Robert Hargreaves, Aug 12 '22 at 11:07
I see. I'd try creating a [covering index](https://www.postgresql.org/docs/14/indexes-index-only-scans.html), maybe the problem here is the random acecss of the heap pages. — Vinicius, Aug 12 '22 at 11:47
Actually we did try that, but it didn't make a great deal of difference, even with zero heap fetches after a VACUUM. — Robert Hargreaves, Aug 12 '22 at 14:09
@RobertHargreaves Please show the plan for the one using the covering index. Even if it is not awesome, it is still possibly valuable information. — jjanes, Aug 12 '22 at 16:21
@jjanes I've now added covering index results in the original question. The difficulty with prewarming is that we can't easily predict the queries that the DB would be hit with. Whilst we appreciate that there might be some slow queries whilst the cache is being built up, we'd like the query to at least perform semi-acceptable (i.e. response time under a couple of seconds) whilst that cache is building up. — Robert Hargreaves, Aug 12 '22 at 20:16
Try to rearrange the order of index and modify query as per index order. As value of column b, d & e are unique so index order should like this index (b, d, e, c, a). — Rahul Biswas, Aug 14 '22 at 03:48
@RahulBiswas surely the order of terms in the WHERE clause cannot affect the execution plan...? As for the index itself, why drop a_id from the index? That's the most selective term? — Robert Hargreaves, Aug 15 '22 at 08:45
I've also found that a Bitmap Scan Index is more efficient (about a 40% speed up) but we don't know why it is, nor why the planner isn't opting for this in the first place (results + explain added to question body above) — Robert Hargreaves, Aug 15 '22 at 12:02
Try rewriting your `IN` clause with `VALUES` like that: `where a_id in (values (24933126), (19612702), (27100661), (73628268))` and see if it helps — Evgeniy Chekan, Aug 16 '22 at 10:30
After doing some research on the matter I stumbled upon this article https://www.datadoghq.com/blog/100x-faster-postgres-performance-by-changing-1-line/, there is a high probability your problem is identical to the one in it. — Dominik Klug, Aug 22 '22 at 15:02

Erwin Brandstetter · Answer 1 · 2022-08-19T17:09:59.520

You are trying to optimize query performance on cold cache.

It's one instance with no other traffic hitting it. We've ran ANALYZE and VACUUM ANALYZE against the table

(Aside, ANALYZE alone adds nothing over VACUUM ANALYZE, so that's redundant.)

To optimize, minimize the number of data pages that have to be read. So ...

...decrease the storage size per row if possible. (With index-only scans, that's mostly just important for the involved index.)
... increase data locality: more qualifying tuples in the same data page means fewer pages to read.

Just reorder PK columns

You should get some improvement from simply re-ordering columns in your PK. You now have:

primary key (a_id, b_id, c_id, d_id, e_id)

With leading a_id. Index tuples for distinct a_id are spread out as much as possible. Exactly what your query does not need. You disclosed:

b_id has one value [...]
d_id has one value (currently)
e_id has one value (currently)
c_id has a range of 0-4
a_id has a range of 0-160,000,000

Reorder columns like this to maximize locality for your query:

ALTER TABLE app.table ADD CONSTRAINT table_pk PRIMARY KEY (b_id, d_id, e_id, c_id, a_id) INCLUDE (a_date);

(Plus, add INCLUDE (a_date) to allow index-only scans, as has been suggested already.)

Since b_id, and d_id / e_id (currently) are constants, those are just noise / ballast. The important part is to move c_id before d_id, this way, we never touch branches of the index with c_id IN (0,1,4), and more of our tuples end up on fewer index pages. It's a mild effect, since we seem to use like half of the spectrum anyway.

More radical

Since b_id is a constant, it shouldn't water down the PK to begin with. The same is true for d_id and e_id if they actually remain constants.

And we don't need e_id for our query at all.

This adapted query:

SELECT a_id, 34 AS b_id, c_id, 0 AS d_id, a_date
FROM   app.table ts
WHERE  c_id IN (2,3)
AND    a_id IN ( < 5000 VALUES > )

.. in combination with this index would be much better:

CREATE INDEX foo ON app.table (c_id, d_id) INCLUDE (a_date)

Probably better, yet:

SELECT a_id, 34 AS b_id, 2 AS c_id, 0 AS d_id, a_date
FROM   app.table ts
WHERE  c_id = 2
AND    a_id IN ( < 5000 VALUES > )
UNION ALL
SELECT a_id, 34 AS b_id, 3 AS c_id, 0 AS d_id, a_date
FROM   app.table ts
WHERE  c_id = 3
AND    a_id IN ( < 5000 VALUES > )

This allows index-only scans with only index conditions (Index Cond: in the query plan) and no filter (Filter:), for minimum cost.

Or even partial indexes for the last query:

CREATE INDEX foo_c2 ON app.table (d_id) INCLUDE (a_date) WHERE c_id = 2;
CREATE INDEX foo_c3 ON app.table (d_id) INCLUDE (a_date) WHERE c_id = 3;

Eliminates irrelevant rows from the tailored index(es) and makes index access slightly cheaper, yet.

Also, consider upgrading to Postgres 13 or Postgres 14 - with improvements for big data.

Consider the bottom part of the manual page "Index-Only Scans and Covering Indexes" for this!

jjanes · Answer 2 · 2022-08-15T16:13:16.700

This question might be pretty specific to Aurora, on which I don't have much experience.

Your index-only scan results are a bit surprising. I would not think it should not take 7733 buffer reads to obtain 5658 rows (plus 2 filtered out and 0 heap fetched). I would not expect it to need more than ~5700 reads. But I understand that Aurora's storage layer is pretty different than the community PostgreSQL, so maybe that has something to do with it. Anyway that is only a reduction of 25%, not the 10 fold you are looking for. EDIT: I realized those extra reads are of internal index pages. I had rejected this idea at first, because 2075 internal pages to 5658 leaf pages is a ridiculous ratio. But then I realized that the leaf pages read by that one query is a tiny fraction of all leaf pages which exist, while the internal pages read is probably the bulk of all of the internal pages which exist. This is probably a flaw in your testing method. To avoid caching the data unfairly, it would be enough to randomly pick a different 5000 a_id each time. Restarting the whole database (or whatever method you used to clear the cache) is way overkill. If it is not overkill because you really are restarting your production database between every query, well, stop doing that.

The read times of about 1ms per read seem rather slow for something using a good SSD layer (my own crappy one does that well), but I cant find any good data about what you should expect from Aurora's storage layer.

I am also curious about the row estimates being off by 30 to 50 fold. Why is that? It just shouldn't be that hard to come up with a more accurate estimate for this. But, I wouldn't think a different plan would be any faster, so the estimate really shouldn't matter. But you never know where a mystery will lead you. What if you just have the a_id IN-list and drop the rest of the column conditions? EDIT: I think I realized the answer to this, the PostgreSQL sampling method used to compute pg_stats.n_distinct is subtly biased in a way that can greatly underestimate n_distinct in the case of a very large table which is clustered on the column being sampled (a_id here), and n_distinct is very important to the selectivity estimate. Fortunately you can manually override this estimate using alter table app."table" alter a_id set (n_distinct = 9999999);. But again, that is not going to do much for you here because there is no better plan to be had. It might be important for other queries though.

But I think your bet course is to take a step back. Why are you running this query? What is the "business case" for it? Where is the list of 5000 ids coming from? Is there some pattern to them?

"I am also curious about the row estimates being off by 30 to 50 fold. Why is that?" - I'm not sure. This confuses me too. Even when I `ANALYZE` the table with `default_statistics_target` set to `1000`, it still thinks it's going to pull back the same amount of rows. — Robert Hargreaves, Aug 15 '22 at 08:45
As for dropping of conditions - interestingly we found the speed is very much the same with those conditions removed (i.e. when only `a_id` and `b_id` are present). We think we could just retrieve more data and cache as much as possible in the API layer. If the DB layer is going to be slow, then we might have to work around it. However, we're still curious as to why it is slow, because it seems overly-slow and we are still concerned about cold queries. — Robert Hargreaves, Aug 15 '22 at 12:05
@RobertHargreaves Why it is so slow seems pretty simple. You are jumping to >5000 random spots in the index, which generates >5000 random IO; and random IO is slow. I don't see how an API cache is going to help here, unless there is some regularity you haven't shown us. If you don't have enough RAM to cache what you need, why is spreading the same RAM over two mostly redundant caches going to make things better? And won't the API cache still suffer from cold queries? — jjanes, Aug 15 '22 at 16:23
@RobertHargreaves I edited my answer to add some realizations I came to after writing the first answer. They don't solve your problem, just explain it more fully. — jjanes, Aug 15 '22 at 16:25
thanks for adding those edits - they've been very helpful! We only rebooted the DB to simulate cold caches - we're not actually doing that in production :) — Robert Hargreaves, Aug 19 '22 at 16:25

score -1 · Answer 3 · answered Aug 22 '22 at 12:06

-1

For your query:

select a_id, b_id, c_id, d_id, a_date
from app.table ts
where a_id in ( <5000 values> ) --lots of value
and b_id = 34 -- 1 value
and c_id in (2,3) -- 4 value
and d_id = 0 -- 1 value
--not including e_id so it will not in index => and e_id -- 1 value

Need to generate an index which columns ordered with min value count

CREATE INDEX INDX_TABLE_1 ON app.table (b_id, d_id, c_id, a_id) INCLUDE (a_date)

It will enough.

Also if you want you can compare performance with this index / Sometimes postgresql playing with in keyword differently :)

CREATE INDEX INDX_TABLE_2 ON app.table (b_id, d_id, c_id) INCLUDE (a_id, a_date)

Please share your test result

And also if your b_id, c_id, d_id... values will include small data set and will be in all queries ( like x_id = 1 ) , you can use PARTITIONED TABLES for better performance.

answered Aug 22 '22 at 12:06

utrucceh

1,076
6
11

I mentioned in a comment in the original question that this is already partitioned data, with further partitioning not likely to help because it would require multiple partitioned to be queried to return a complete result set. – Robert Hargreaves Aug 22 '22 at 13:17
I don't understand how moving `a_id` to an INCLUDE instead of the index key would help? – Robert Hargreaves Aug 22 '22 at 13:17
Sometimes postgersql choosing full table scan rather than index scan when index size like table size. This is not for everytime. But when data inside in include datas postgresql calculating size like there is no include data when choosing plan. So I just say 'try' it – utrucceh Aug 23 '22 at 11:13

PostgreSQL: Underperforming query on large table with composite key

3 Answers3

Just reorder PK columns

More radical