PostgreSQL fast query to get most recent rows

Question

I have a PostgreSQL table with a bunch of events, each containing a timestamp, id, and action. I expect a few million events per day, and expect the table to grow to several billion rows eventually.

I want to query the most recent events for each item (a few hundred thousand distinct ids) before a given date, but currently the query is extremely slow, taking 1-2 hours minimum (the table currently has around 100 million rows). Is there a way to speed up this query?

SELECT a.* FROM (
    with events as (
        SELECT
          ROW_NUMBER() OVER (PARTITION BY item ORDER BY time_stamp DESC) AS rn,
          *
        FROM event_updates
        WHERE time_stamp < '2023-05-01'
    )
    SELECT * FROM events WHERE rn=1 ORDER BY item
) a;

DDL for table

CREATE TABLE "event_updates" (
  "id" int4 NOT NULL DEFAULT nextval("event_updates"::regclass),
  "time_stamp" timestamptz(6) NOT NULL DEFAULT CURRENT_TIMESTAMP,
  "item" varchar(32) COLLATE "pg_catalog"."default" NOT NULL,
  "event_type" int2,
)
PARTITION BY ();
ALTER TABLE "event_updates" OWNER TO "owners";

Could you please share the DDL for the table and indexes involved? (in plain text) Your current query most likely doesn't take the most recent data, just data older than May 2023. And it takes all data older than May 2023 and that could be years of data. Depending on when you started collecting data. — Frank Heikens, Aug 10 '23 at 22:01
See https://stackoverflow.com/questions/tagged/greatest-n-per-group+postgresql?tab=Votes — Bergi, Aug 10 '23 at 22:23
@FrankHeikens I've just added in the DDL for the table. I think the query does get the most recent data older than May 2023. I'm selecting all rows older than May 2023, and then taking the first row of that partitioned by the item when sorted by the timestamp. — John James, Aug 10 '23 at 22:27
What's `PARTITION BY ()`? And are there no indices on this table? Not even a `UNIQUE` constraint? Then of course it's slow. — Bergi, Aug 10 '23 at 22:30
Please provide your *actual*, working table definition (incl. all constraints), all indexes and other information as instructed here: https://stackoverflow.com/tags/postgresql-performance/info Is there an additional table with unique IDs? (important!) — Erwin Brandstetter, Aug 10 '23 at 22:41
I assume *"most recent events for each id"* is supposed to mean *"most recent events for each `item`"* — Erwin Brandstetter, Aug 10 '23 at 22:56

score 2 · Accepted Answer · answered Aug 10 '23 at 22:39

2

First of all, you need an index on item and the timestamp. Second, you select the max(timestamp) for each item and then select the record you need:

CREATE INDEX idx_event_updates_item_time_stamp
    ON event_updates(item, time_stamp);

SELECT *
FROM event_updates
    JOIN (SELECT item
               , MAX(time_stamp) time_stamp
          FROM event_updates
          GROUP BY item
          ) sub USING (item, time_stamp);

You might want to create the index per partition, concurrently, to avoid locking issues. It takes longer however.

answered Aug 10 '23 at 22:39

Frank Heikens

117,544
24
142
135

`I want to query the most recent events for each **id**`. why `item`? – Kosh Aug 10 '23 at 22:57
2

@Kosh Because of the original query, this part: `PARTITION BY item ORDER BY time_stamp DESC` – Frank Heikens Aug 10 '23 at 22:58

Erwin Brandstetter · Answer 2 · 2023-08-11T03:19:28.020

2

Emulate an index skip scan for best performance. Should be the fastest way:

WITH RECURSIVE cte AS (
   (
   SELECT *
   FROM   event_updates
   ORDER  BY item, time_stamp DESC
   LIMIT  1
   )
   UNION ALL
   SELECT e.*
   FROM   cte c
   CROSS  JOIN LATERAL (
      SELECT e.*
      FROM   event_updates e
      WHERE  e.item > c.item
      ORDER  BY item, time_stamp DESC
      LIMIT  1
      ) e
   )
TABLE  cte;

In combination with a matching index (essential):

CREATE INDEX event_updates_item_time_stamp_idx ON event_updates(item, time_stamp DESC);

See:

More optimization may be possible.
I'll explain some more after you have disclosed requested information.

edited Aug 11 '23 at 03:19

answered Aug 10 '23 at 22:53

Erwin Brandstetter

605,456
145
1,078
1,228

`I want to query the most recent events for each **id**`. why `item`? – Kosh Aug 10 '23 at 22:56
@Kosh: My educated guess: that's a typo by the OP, `id` is just the surrogate PK of the table, and the OP really meant *"for each item"*. – Erwin Brandstetter Aug 10 '23 at 22:59
`item` is varchar(32) and my guess was it's a kind of a name behind the `id` added for human-readability; on the other hand, `id` is int(4) which makes `a few hundred thousand distinct ids` look doubtable to me. – Kosh Aug 10 '23 at 23:12
1

@Kosh: `id` is a serial column drawing from a sequence. Wouldn't make sense. – Erwin Brandstetter Aug 10 '23 at 23:14
sure, thanks! I missed that. – Kosh Aug 10 '23 at 23:16

PostgreSQL fast query to get most recent rows

2 Answers2