0

I have a PostgreSQL table with a bunch of events, each containing a timestamp, id, and action. I expect a few million events per day, and expect the table to grow to several billion rows eventually.

I want to query the most recent events for each item (a few hundred thousand distinct ids) before a given date, but currently the query is extremely slow, taking 1-2 hours minimum (the table currently has around 100 million rows). Is there a way to speed up this query?

SELECT a.* FROM (
    with events as (
        SELECT
          ROW_NUMBER() OVER (PARTITION BY item ORDER BY time_stamp DESC) AS rn,
          *
        FROM event_updates
        WHERE time_stamp < '2023-05-01'
    )
    SELECT * FROM events WHERE rn=1 ORDER BY item
) a;

DDL for table

CREATE TABLE "event_updates" (
  "id" int4 NOT NULL DEFAULT nextval("event_updates"::regclass),
  "time_stamp" timestamptz(6) NOT NULL DEFAULT CURRENT_TIMESTAMP,
  "item" varchar(32) COLLATE "pg_catalog"."default" NOT NULL,
  "event_type" int2,
)
PARTITION BY ();
ALTER TABLE "event_updates" OWNER TO "owners";
John James
  • 219
  • 1
  • 8
  • 1
    Could you please share the DDL for the table and indexes involved? (in plain text) Your current query most likely doesn't take the most recent data, just data older than May 2023. And it takes all data older than May 2023 and that could be years of data. Depending on when you started collecting data. – Frank Heikens Aug 10 '23 at 22:01
  • 1
    See https://stackoverflow.com/questions/tagged/greatest-n-per-group+postgresql?tab=Votes – Bergi Aug 10 '23 at 22:23
  • 1
    @FrankHeikens I've just added in the DDL for the table. I think the query does get the most recent data older than May 2023. I'm selecting all rows older than May 2023, and then taking the first row of that partitioned by the item when sorted by the timestamp. – John James Aug 10 '23 at 22:27
  • 2
    What's `PARTITION BY ()`? And are there no indices on this table? Not even a `UNIQUE` constraint? Then of course it's slow. – Bergi Aug 10 '23 at 22:30
  • 1
    No indexes? That makes it slow by design. – Frank Heikens Aug 10 '23 at 22:30
  • 1
    Please provide your *actual*, working table definition (incl. all constraints), all indexes and other information as instructed here: https://stackoverflow.com/tags/postgresql-performance/info Is there an additional table with unique IDs? (important!) – Erwin Brandstetter Aug 10 '23 at 22:41
  • 1
    I assume *"most recent events for each id"* is supposed to mean *"most recent events for each `item`"* – Erwin Brandstetter Aug 10 '23 at 22:56

2 Answers2

2

First of all, you need an index on item and the timestamp. Second, you select the max(timestamp) for each item and then select the record you need:

CREATE INDEX idx_event_updates_item_time_stamp
    ON event_updates(item, time_stamp);

SELECT *
FROM event_updates
    JOIN (SELECT item
               , MAX(time_stamp) time_stamp
          FROM event_updates
          GROUP BY item
          ) sub USING (item, time_stamp);

You might want to create the index per partition, concurrently, to avoid locking issues. It takes longer however.

Frank Heikens
  • 117,544
  • 24
  • 142
  • 135
2

Emulate an index skip scan for best performance. Should be the fastest way:

WITH RECURSIVE cte AS (
   (
   SELECT *
   FROM   event_updates
   ORDER  BY item, time_stamp DESC
   LIMIT  1
   )
   UNION ALL
   SELECT e.*
   FROM   cte c
   CROSS  JOIN LATERAL (
      SELECT e.*
      FROM   event_updates e
      WHERE  e.item > c.item
      ORDER  BY item, time_stamp DESC
      LIMIT  1
      ) e
   )
TABLE  cte;

In combination with a matching index (essential):

CREATE INDEX event_updates_item_time_stamp_idx ON event_updates(item, time_stamp DESC);

See:

More optimization may be possible.
I'll explain some more after you have disclosed requested information.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228