Very slow query performance in aws postgresql for a table with 4 billion rows

Question

The Problem

We have a relational table where we store user activity. A query like the following takes 77 seconds!

FROM "site_activity"
WHERE
    (
        NOT "site_activity"."is_deleted"
        AND "site_activity"."user_id" = 68812389
        AND NOT (
            "site_activity"."kind" IN (
                'updated',
                'duplicated',
                'reapplied'
            )
        )
        AND NOT (
            "site_activity"."content_type_id" = 14
            AND "site_activity"."kind" = 'created'
        )
    )
ORDER BY
    "site_activity"."created_at" DESC,
    "site_activity"."id" DESC
LIMIT  9;

The query plan looks like this

                                     QUERY PLAN
--------------------------------------------------------------------------------------------
Limit
    (cost=17750.72..27225.75 rows=9 width=16)
    (actual time=199501.336..199501.338 rows=9 loops=1)
  Output: id, created_at
  Buffers: shared hit=4502362 read=693523 written=37273
  I/O Timings: read=190288.205 write=446.870
  ->  Incremental Sort
      (cost=17750.72..2003433582.97 rows=1902974 width=16)
      (actual time=199501.335..199501.336 rows=9 loops=1)
        Output: id, created_at
        Sort Key: site_activity.created_at DESC, site_activity.id DESC
        Presorted Key: site_activity.created_at
        Full-sort Groups: 1  Sort Method: quicksort  Average Memory: 25kB  Peak Memory: 25kB
        Buffers: shared hit=4502362 read=693523 written=37273
        I/O Timings: read=190288.205 write=446.870
        ->  Index Scan Backward using site_activity_created_at_company_id_idx on public.site_activity
            (cost=0.58..2003345645.30 rows=1902974 width=16)
            (actual time=198971.283..199501.285 rows=10 loops=1)
              Output: id, created_at
              Filter: (
                (NOT site_activity.is_deleted) AND (site_activity.user_id = 68812389)
                AND ((site_activity.kind)::text <> ALL ('{updated,duplicated,reapplied}'::text[]))
                AND ((site_activity.content_type_id <> 14) OR ((site_activity.kind)::text <> 'created'::text))
              )
              Rows Removed by Filter: 14735308
              Buffers: shared hit=4502353 read=693523 written=37273
              I/O Timings: read=190288.205 write=446.870
Settings: effective_cache_size = '261200880kB',
          effective_io_concurrency = '400',
          jit = 'off',
          max_parallel_workers = '24',
          random_page_cost = '1.5',
          work_mem = '64MB'
Planning:
  Buffers: shared hit=344
Planning Time: 6.429 ms
Execution Time: 199501.365 ms
(22 rows)

Time: 199691.997 ms (03:19.692)

Table Facts

It contains a little more than 4 billion rows.

The table structure is

                                            Table "public.site_activity"
    Column      |           Type           | Collation | Nullable |                   Default
----------------+--------------------------+-----------+----------+----------------------------------------------
id              | bigint                   |           | not null | nextval('site_activity_id_seq'::regclass)
created_at      | timestamp with time zone |           | not null |
modified_at     | timestamp with time zone |           | not null |
is_deleted      | boolean                  |           | not null |
object_id       | bigint                   |           | not null |
kind            | character varying(32)    |           | not null |
context         | text                     |           | not null |
company_id      | integer                  |           | not null |
content_type_id | integer                  |           | not null |
user_id         | integer                  |           |          |
Indexes:
    "site_activity_pkey" PRIMARY KEY, btree (id)
    "site_activity_modified_at_idx" btree (modified_at)
    "site_activity_company_id_idx" btree (company_id)
    "site_activity_created_at_company_id_idx" btree (created_at, company_id)
    "site_activity_object_id_idx" btree (object_id)
    "site_activity_content_type_id_idx" btree (content_type_id)
    "site_activity_kind_idx" btree (kind)
    "site_activity_kind_idx1" btree (kind varchar_pattern_ops)
    "site_activity_user_id_idx" btree (user_id)
Foreign-key constraints:
    "site_activity_company_id_fk_site_company_id" FOREIGN KEY (company_id)
        REFERENCES site_company(id) DEFERRABLE INITIALLY DEFERRED
    "site_activity_content_type_id_fk_django_co" FOREIGN KEY (content_type_id)
        REFERENCES django_content_type(id) DEFERRABLE INITIALLY DEFERRED
    "site_activity_user_id_fk_site_user_id" FOREIGN KEY (user_id)
        REFERENCES site_user(id) DEFERRABLE INITIALLY DEFERRED

a. kind is treated as an enum. In db we store it as varchar. But in the application (python) we treat it as enum. So the values are fixed. There are around 100 values in it.

b.content_type_id has around 80 values.

This is the distribution of values,

a. context is actually JSON with a max 8Mb size.

a. 3 content_type_id values holds 92% of the rows. They are 14 and 19.

a. 3 kind consumes 75% rows. These are created, updated and sent.

a. The combination of kind and content_type_id creates 460 values. Among them, 1 combination contains 35% of rows and we exclude them in the query all time.
The replica instance has type db.r5.12xlarge. 24 cores, 48 vCPUs, 384GB Mem, storage type io1.

Question

How do we handle if the table grows to 100 billion? In the current projection, this can happen in the next 3-5 years.
Is NoSQL a good solution? Note we are not accessing the documents with only id or kind.

Notes

The facts that I presented might bias the solution to replication in the same host and then later sharding over multiple hosts. But if there is some other solution that can keep up to the 100 billion mark, we should be good.
We don't have to use AWS. But preferred.

Performance will be directly related to both the hardware specification / cpus, ability of the query to go parallel and how you tune queries / index the table / partition the data — Stu, Feb 23 '23 at 11:35
You can consider in-memory databases like clickhouse. Though not a relational database, it is compatible with Postgres — mrrobot.viewsource, Feb 23 '23 at 11:38
Posting the explain plan will garner more direct responses in terms of tuning that query. — VynlJunkie, Feb 23 '23 at 11:55
Could you please share the results from EXPLAIN(ANALYZE, VERBOSE, BUFFERS, SETTINGS) for your SQL statements? (in plain text, as an update of your question) — Frank Heikens, Feb 23 '23 at 12:08
@jarmod Look carefully. There is a `site_activity_user_id_idx` index — Shiplu Mokaddim, Feb 23 '23 at 12:47
@ShipluMokaddim - so the question you should be asking is "why isn't this query using that index?" There are a couple of reasons that could happen. First, it might be that `user_id` isn't very selective: if you have only a dozen users, then there's minimal benefit to the index. Another explanation is that your table statistics aren't very good. This part of the documentation might help you: https://www.postgresql.org/docs/12/planner-stats.html — kdgregory, Feb 23 '23 at 13:26
As a general comment, this appears to be a "business intelligence" style of query, and looking at all of the indexes on that table, it appears that the table is primarily used for similar queries. As such, I'd think about offloading the table into S3 and using Athena to make such queries. It may or may not improve query performance (although probably will), but more important is that such queries put an undesirable load on an OLTP database. — kdgregory, Feb 23 '23 at 13:33
@kdgregory It's the user activity that a site owner/admin can see in different pages their site. We have around 100k of such sites. Our internal BI team exports data from here to their own data warehouse. — Shiplu Mokaddim, Feb 23 '23 at 14:08
Isn't the predicate `AND "site_activity"."kind" = 'created'` redundant? If that's the case, remove it. — The Impaler, Feb 23 '23 at 14:20
@ShipluMokaddim - OK, but my other comment was the important one. — kdgregory, Feb 23 '23 at 17:12
You show `kind character varying(32)`, but then you state: `kind is actually an enum`. Makes quite a difference. Then your index `site_activity_kind_idx1` makes no sense. Then again, your query plan shows `site_activity.kind)::text <> 'created'::text)` Please fix your question. A 4-byte `enum` is much cheaper in a multicolumn index than `varchar(32)`. Also: which of the filters in this query are immutable, and which can change? How selective is each of them? And `user_id` can be NULL? Also add your `SELECT` list, it's relevant. — Erwin Brandstetter, Feb 28 '23 at 00:53
You say `2 combination contains 65% of rows and we exclude them in the query all time`. But the query only excludes *one* combination. You mention: `3 content_type_id values holds 92% of the rows`, but you don't say which. You mention `3 kind consumes 75% rows` Again, which ones? And `user_id` does not seem to be selective. How many distinct values in `user_id`? How many rows for the given one? — Erwin Brandstetter, Feb 28 '23 at 01:26
@ErwinBrandstetter kind is is varchar in db. but in app it's treated as enum. I have updated the explanation in the question. `"site_activity"."content_type_id" = 14 AND "site_activity"."kind" = 'created'` and `"site_activity"."is_deleted"` is immutable. Others can change — Shiplu Mokaddim, Mar 04 '23 at 10:56
@ErwinBrandstetter I have updated with those specific content type id, and kind names. Corrected the fact 65% of the values are excluded. It's only 1 combination that excludes 35% rows. — Shiplu Mokaddim, Mar 04 '23 at 11:07

jjanes · Answer 1 · 2023-02-23T17:45:51.593

6

The current plan is to scan the rows already ordered by "created_at" (using an index) and then stop once it finds 10 (plus maybe a few rows to account for ties) passing the rest of of the conditions. It thinks it will do this very quickly, after only about 1/73,000 of the table (27225.75 / 2003433582.97). but in fact it had to scan much more than that (14735308 / 4000000000, or 1/270 of the table). So it grossly misestimated that part. I don't know if it misestimated it because the number of rows meeting the conditions was estimated incorrectly (It thought there would be 1902974, we don't know how many there actually were, since it stopped early and so stopped counting them) or because it assumed the matching rows would be evenly dispersed over the index, when they were not.

The best index for you will probably be on (user_id, created_at). That way you get to jump to just the part of the index which has the correct user_id (which I assume is where the vast majority of your selectivity comes from) and then still walk that part already in order by "created_at". And you can drop the original index just on (user_id), as the new one will be good for anything that the old one is good for. You could also add "is_deleted" between the other two columns in that index, as it will not spoil the ordering property and will provide some additional selectivity (but probably not much). Any other columns added there will spoil the ordering property, however.

edited Feb 23 '23 at 17:45

answered Feb 23 '23 at 14:26

jjanes

37,812
5
27
34

Does creating such an index will allow me to handle 100 billion records? That was my question. – Shiplu Mokaddim Feb 24 '23 at 16:44
when you have as many records as 3 Billion, Indexing alone will not give you what you want. You will want to explore sharding and partitioning. Again there are no silver bullets, but mongoDB is also another option - although your disk requirements would be on the higher side. using an SSD is highly recommended for faster I/O whichever way you choose. – Olamide226 Mar 04 '23 at 15:30
1

@Olamide226 Indexes are very efficient, even on large tables. I'd say your comment is misleading and opinionated. Partitioning may be interesting, but for other reasons. – Laurenz Albe Mar 06 '23 at 03:03
@LaurenzAlbe Yes I agree that indexes are very efficient and I do not downplay that fact in anyway. But on very large tables it is not an automatic silver bullet. You require a combination of indexing and other strategies to achieve optimum performance is what I was driving at. – Olamide226 Mar 07 '23 at 08:58
@Olamide226 And I claim that that isn't true, as far as query performance is concerned. – Laurenz Albe Mar 07 '23 at 09:15
1

@LaurenzAlbe I'd give you a real life scenario - We had a Postgres DB that grew into billions of rows and it kept increasing at a rate of ~300 million rows per month. I can tell you for free that querying on an indexed column would take a long time and Partioning reduced this significantly. In fact, if you index your DB and still compose a wrong query, it's as good as not indexing. So its misleading to say an index is a silver bullet. You need a combination of a well written query IN COMBINATION with an Index and even hardware performance sometimes to get an optimal performance from your DB – Olamide226 Mar 07 '23 at 09:54

Erwin Brandstetter · Accepted Answer · 2023-03-06T21:14:29.740

Query

Start by formatting the WHERE clause to make it easier to understand. Comes down to:

FROM   site_activity s
WHERE  s.user_id = 68812389
AND    NOT s.is_deleted
AND    s.kind <> ALL ('{updated,duplicated,reapplied}'::text[])
AND    (content_type_id <> 14 OR kind <> 'created')
ORDER  BY s.created_at DESC, s.id DESC
LIMIT  9;

Index

You commented you always exclude rows for these two conditions. So this partial, multicolumn index would be the optimum:

CREATE INDEX ON public.site_activity (user_id, created_at, id)
WHERE  NOT is_deleted
AND   (content_type_id <> 14 OR kind <> 'created')

Adding id only makes sense if there are many rows with the same (user_id, created_at). Else drop id from the index.

Excluding large, irrelevant parts of the table from the index can pay for such big indexes. (But you may prevent HOT updates for changes on any of the columns involved in the index, including the ones in the WHERE clause.)

The index can only be used while its filters are an obvious subset of the filters in the query.

Table definition

It would pay to optimize your table definition. Like:

    Column      |     Type     | Nullable |                   Default
----------------+--------------+----------+----------------------------------------------
id              | bigint       | not null | nextval('site_activity_id_seq'::regclass)
user_id         | int          | not null |  -- NOT NULL ??
kind            | smallint     | not null |  -- "around 100 values"
content_type_id | smallint     | not null |  -- "around 80 values"
created_at      | timestamptz  | not null |
modified_at     | timestamptz  | not null |
object_id       | bigint       | not null |
company_id      | int          | not null |
is_deleted      | bool         | not null |
context         | text         | not null |

Most importantly, kind now occupies 2 bytes instead of 33 bytes or more. See:

What is the overhead for varchar(n)?

Plus substantial savings from rearranging the column order. See:

Calculating and saving space in PostgreSQL

The big column context ("with a max 8Mb size") will typically be stored out-of-line in a TOAST table for most rows, so the tuples to work with shrink to half their size. This makes a difference for most operations.

And I suspect that some of your indexes may be expendable.