SELECT DISTINCT is slower than expected on my table in PostgreSQL

Question

Here's my table schema:

CREATE TABLE tickers (
    product_id TEXT NOT NULL,
    trade_id INT NOT NULL,
    sequence BIGINT NOT NULL,
    time TIMESTAMPTZ,
    price NUMERIC NOT NULL,
    side TEXT NOT NULL,
    last_size NUMERIC NOT NULL,
    best_bid NUMERIC NOT NULL,
    best_ask NUMERIC NOT NULL,
    PRIMARY KEY (product_id, trade_id)
);

My application subscribes to Coinbase Pro's websocket on the "ticker" channel and inserts a row into the tickers table whenever it receives a message.

The table has nearly two million rows now.

I assumed that running SELECT DISTINCT product_id FROM tickers would be fast, but it takes around 500 to 600 milliseconds. Here's the output from EXPLAIN ANALYZE:

HashAggregate  (cost=47938.97..47939.38 rows=40 width=8) (actual time=583.105..583.110 rows=40 loops=1)
  Group Key: product_id
  ->  Seq Scan on tickers  (cost=0.00..42990.98 rows=1979198 width=8) (actual time=0.030..195.536 rows=1979243 loops=1)
Planning Time: 0.068 ms
Execution Time: 583.137 ms

If I turn off seq scanning by running SET enable_seqscan = FALSE (not something I want to actually rely on, just doing it for testing purposes) then the query is a little faster. Between 400 and 500 milliseconds. Here's the output from EXPLAIN ANALYZE:

Unique  (cost=0.43..80722.61 rows=40 width=8) (actual time=0.020..480.339 rows=40 loops=1)
  ->  Index Only Scan using tickers_pkey on tickers  (cost=0.43..75772.49 rows=1980051 width=8) (actual time=0.019..344.113 rows=1980160 loops=1)
        Heap Fetches: 328693
Planning Time: 0.064 ms
Execution Time: 480.386 ms

There are only 40 unique product IDs in the table. I assumed that since product_id is part of the composite primary key, and thus indexed, SELECT DISTINCT product_id FROM tickers would be much faster. But as it turns out, the query planner defaults to using a seq scan rather than the index, and even if I force it to use the index it's still slow (but a little faster than seq scan). I realize I could create another table to store nothing but unique product IDs and query that instead, but I'm more concerned with the reason(s) why my query on the tickers table is taking so long.

EDIT #1: I tried creating an index solely on the product_id column (CREATE INDEX idx_tickers_product_id ON tickers (product_id)) and the query planner still does a sequential scan unless I run SET enable_seqscan = FALSE first. But its performance is slightly better (10 to 50 milliseconds faster) than when the composite PK index is used.

EDIT #2: I tried Erwin Brandstetter's solution and it greatly improved the speed. There are now 2.25 million rows in the table and the execution only takes 0.75 milliseconds!

EDIT #3: I wanted to augment the accepted solution in order to retrieve the ticker count (max(trade_id) - min(trade_id) + 1) as well as the min and max time for each product id. I created a new question for this: How to use index skip emulation in PostgreSQL to retrieve distinct product IDs and also min/max for certain columns

I would have expected a full index scan, too, but well, sometimes it's faster just to read the table sequentially instead of finding one's way through an index. An additional index on product_id alone would almost certainly be used. — Thorsten Kettner, Mar 31 '21 at 19:23
This would be more efficient with an access path known as as "index skip scan" in other DBMS, but unfortunately Postgres doesn't have that yet. One way to improve performance would be to use a `group by` instead as that can make use of a parallel scan. — , Mar 31 '21 at 19:23
Thanks @ThorstenKettner. I tried adding an index solely to the product_id column to see what it would do. See "EDIT #1" in the question for details. — Richard Gieg, Mar 31 '21 at 20:38
I know you found a good solution already, but one reason why the index only scan wasn't much faster than the seq scan was because it had to visit the heap 300k times. This is likely why postgres chose the seq scan. Vacuum the table to update the visibility map and an index only scan will be much faster. — Jeremy, Apr 01 '21 at 01:02
Thanks @Jeremy. Is that something I would have to run again as more rows are added to the table? — Richard Gieg, Apr 01 '21 at 15:43

Erwin Brandstetter · Accepted Answer · 2021-03-31T22:57:22.880

19

While there is no index skip scan in Postgres yet, emulate it:

WITH RECURSIVE cte AS (
   (   -- parentheses required
   SELECT product_id
   FROM   tickers
   ORDER  BY 1
   LIMIT  1
   )
   UNION ALL
   SELECT l.*
   FROM   cte c
   CROSS  JOIN LATERAL (
      SELECT product_id
      FROM   tickers t
      WHERE  t.product_id > c.product_id  -- lateral reference
      ORDER  BY 1
      LIMIT  1
      ) l
   )
TABLE  cte;

With an index on (product_id) and only 40 unique product IDs in the table this should be Fast. With capital F.
The PK index on (product_id, trade_id) is good for it, too!

With only very few rows per product_id (the opposite of your data distribution), DISTINCT / DISTINCT ON would be as fast or faster.

Work to implement index skip scans is ongoing.
See:

edited Mar 31 '21 at 22:57

answered Mar 31 '21 at 19:55

Erwin Brandstetter

605,456
145
1,078
1,228

1

This is great! I'm unfamiliar with recursive CTEs and CROSS JOIN LATERAL, so I have some homework to do. Anyway, execution only takes 0.75 milliseconds. Added that to my original question as well. – Richard Gieg Mar 31 '21 at 20:58
Is it possible to utilize this approach to also retrieve the min and max trade_id as well as min and max time for each of the unique product ids? Or is this approach mainly geared toward getting the distinct values? – Richard Gieg Mar 31 '21 at 21:05
1

@RichardGieg: All possible. Getting min *and* max complicates matters, but still possible. To keep it simple, you could run multiple very fast queries. Once you have the distinct list of product_ids, you can reuse that to make additional queries simpler and faster, yet. Detailed guide in one of the links I added: https://stackoverflow.com/questions/25536422/optimize-group-by-query-to-retrieve-latest-row-per-user/25536748#25536748 Ask another question if you are in over your head. You can drop a comment here to link forward ... – Erwin Brandstetter Mar 31 '21 at 21:11
My new question: https://stackoverflow.com/questions/66895595/how-to-use-index-skip-emulation-in-postgresql-to-retrieve-distinct-product-ids-a – Richard Gieg Mar 31 '21 at 21:30

SELECT DISTINCT is slower than expected on my table in PostgreSQL

1 Answers1

Linked