Index skip scan emulation to retrieve distinct product IDs and min/max for additional columns

Question

Here's my table schema:

CREATE TABLE tickers (
    product_id TEXT NOT NULL,
    trade_id INT NOT NULL,
    sequence BIGINT NOT NULL,
    time TIMESTAMPTZ NOT NULL,
    price NUMERIC NOT NULL,
    side TEXT NOT NULL,
    last_size NUMERIC NOT NULL,
    best_bid NUMERIC NOT NULL,
    best_ask NUMERIC NOT NULL,
    PRIMARY KEY (product_id, trade_id)
);

CREATE INDEX idx_tickers_product_id_time ON tickers (product_id, time);

My application subscribes to Coinbase Pro's websocket on the "ticker" channel and inserts a row into the tickers table whenever it receives a message.

The table has over two million rows now.

I learned how to use index skip scan emulation (see: SELECT DISTINCT is slower than expected on my table in PostgreSQL) in PostgreSQL in order to quickly retrieve distinct product_id values from this table, rather than using the slower SELECT DISTINCT method.

I also want to retrieve min/max values for other columns. Here's what I came up with. It takes ~2.9 milliseconds over 2.25 rows.

Is there a better way to accomplish this?

WITH product_ids AS (
    WITH RECURSIVE cte AS (
       (   -- parentheses required
           SELECT product_id
           FROM tickers
           ORDER BY 1
           LIMIT 1
       )
       UNION ALL
       SELECT l.*
       FROM cte c
       CROSS JOIN LATERAL (
          SELECT product_id
          FROM tickers t
          WHERE t.product_id > c.product_id  -- lateral reference
          ORDER BY 1
          LIMIT 1
          ) l
       )
    TABLE cte
)
SELECT
    product_id,
    (SELECT (MAX(trade_id) - MIN(trade_id) + 1) FROM tickers WHERE product_id = product_ids.product_id) AS ticker_count,
    (SELECT MIN(time) FROM tickers WHERE product_id = product_ids.product_id) AS min_time,
    (SELECT MAX(time) FROM tickers WHERE product_id = product_ids.product_id) AS max_time
FROM product_ids
ORDER BY ticker_count DESC

There is a combined index on product_id and time. Updated schema in question. — Richard Gieg, Mar 31 '21 at 21:33
`(MAX(trade_id) - MIN(trade_id)`? Subtracting IDs? That's not a typo? — Erwin Brandstetter, Mar 31 '21 at 21:37
That's correct. The trade_id for each product always goes up by 1 as new tickers are broadcasted by Coinbase Pro's websocket "ticker" channel. I previously used `COUNT` in previous iterations of this query, but I came up with this as an optimization because I found that `COUNT` was very slow once my table started filling up with millions of rows. But if we can avoid doing this subtraction I think it would be better. — Richard Gieg, Mar 31 '21 at 21:40
@ErwinBrandstetter: Oops... it was supposed to be (MAX(trade_id) - MIN(trade_id) + 1), not (MAX(trade_id) - MIN(trade_id)). I edited my question. — Richard Gieg, Mar 31 '21 at 21:51
Writing this as a comment because I get super annoyed when people say "why don't you redesign your schema?". But have you considered having a seperate version table? One that's a simple mapping between the product_id and the trade_id? And you can join that onto the main table. — Rol, Apr 07 '22 at 13:31

Erwin Brandstetter · Accepted Answer · 2021-03-31T22:20:10.233

3

Query

Using the existing index on (product_id, time) we can get two for the price of one, i.e. fetch product_id and minimum time in one index scan:

WITH RECURSIVE product_ids AS (
   (   -- parentheses required
   SELECT product_id, time AS min_time
   FROM   tickers
   ORDER  BY 1, 2
   LIMIT  1
   )
   UNION ALL
   SELECT l.*
   FROM   product_ids p
   CROSS JOIN LATERAL (
      SELECT t.product_id, t.time
      FROM   tickers t
      WHERE  t.product_id > p.product_id
      ORDER  BY 1, 2
      LIMIT  1
      ) l
   )
SELECT product_id, min_time
    , (SELECT MAX(time) FROM tickers WHERE product_id = p.product_id) AS max_time
    , (SELECT MAX(trade_id) - MIN(trade_id) + 1 FROM tickers WHERE product_id = p.product_id) AS ticker_count
FROM   product_ids p
ORDER  BY ticker_count DESC;

Also, no need for a second CTE wrapper.

Indexes

Currently you have two indexes: The PK index on (product_id, trade_id), and another one on (product_id, time). You might optimize this by reversing the column order in one of both. Like:

PRIMARY KEY (trade_id, product_id)

Logically equivalent, but typically more efficient as it covers a wider range of possible queries. See (again):

Is a composite index also good for queries on the first field?

We only need the existing index on (product_id, time), so no direct effect on this query.

edited Mar 31 '21 at 22:20

answered Mar 31 '21 at 21:47

Erwin Brandstetter

605,456
145
1,078
1,228

When I run this query it never returns. Same when I do `EXPLAIN ANALYZE` on it. – Richard Gieg Mar 31 '21 at 22:01
You tried the latest version with the fix for the `WHERE` clause? – Erwin Brandstetter Mar 31 '21 at 22:02
I guess I didn't, haha. Works great. Thanks!! – Richard Gieg Mar 31 '21 at 22:03
1

Execution time is now ~2.6 ms instead of 2.9 ms – Richard Gieg Mar 31 '21 at 22:06
Makes sense, only saved one index scan. Still. :) – Erwin Brandstetter Mar 31 '21 at 22:07
Every little bit helps. You've enlightened me today! Once I digest these new PostgreSQL concepts I'll have some more tools in the tool box. Thanks again :) – Richard Gieg Mar 31 '21 at 22:10

Index skip scan emulation to retrieve distinct product IDs and min/max for additional columns

1 Answers1

Query

Indexes

Linked