Optimize updating first, last, and second to last ranked value

Question

I need to cache the first, last, and second to last time a thing happened per user. The history table I'm querying has hundreds of millions of rows (we're caching so we can truncate it), and the table I'm updating has dozens of millions.

Currently I'm doing it in batches of 1000 to avoid locking the tables. The query is like so:

with ranked as (
  select
      user_id,
      rank() over (partition by user_id order by created_at desc) as ranked_desc,
      rank() over (partition by user_id order by created_at asc) as ranked_asc,
      created_at
  from history
  where type = 'SomeType' and
        user_id between $1 and $2
)
update
  users u
set
  latest_at = (
    select created_at
    from ranked
    where ranked.ranked_desc = 1 and ranked.user_id = u.id
  ),
  previous_at = (
    select created_at
    from ranked
    where ranked.ranked_desc = 2 and ranked.user_id = u.id
  ),
  first_at = (
    select created_at
    from ranked
    where ranked.ranked_asc = 1 and ranked.user_id = u.id
  )
from ranked
where u.id = ranked.user_id

Relevant indexes on history are these. They are all btree indexes.

(created_at)
(user_id, created_at)
(user_id, type)
(type, created_at)

Can this be optimized? I feel this can be done without the subqueries.

Any particular reason to use `rank()` instead of `row_number()`? Will raise an exception for any duplicate in the used ranks. Also: please always declare your version of Postgres and provide a basic table definition with column types and constraints (`CREATE TABLE ...` statement). — Erwin Brandstetter, Sep 26 '21 at 00:49
@ErwinBrandstetter Postgres 13, as tagged. I'm on mobile at the moment, but assume column types are sane (timestamps, properly declared foreign keys, etc). Thanks for the advice about rank, it shouldn't come up, but I'll switch it out. — Schwern, Sep 26 '21 at 00:56
It would help to know if `(user_id, created_at)` is `UNIQUE`. (Your use of `rank()` indicates as much!) Or can there be multiple entries with the same timestamp? — Erwin Brandstetter, Sep 26 '21 at 01:09
@ErwinBrandstetter It is not a unique index, but you can assume it is effectively unique. — Schwern, Sep 26 '21 at 01:19
Have you considered using a trigger to update the users table every time a new row is entered in the history table? — Blue Star, Sep 26 '21 at 01:36
@BlueStar We effectively have that now, but we just added it. This is to backfill the users columns so we can truncate the enormous history table. — Schwern, Sep 26 '21 at 01:40
Please consider [instructions for \[posgresql-performance\] questions](https://stackoverflow.com/tags/postgresql-performance/info). — Erwin Brandstetter, Sep 26 '21 at 02:11

Erwin Brandstetter · Accepted Answer · 2021-09-26T02:13:38.090

Since we have the all-important index on (user_id, created_at), I suggest:

UPDATE users u
SET    first_at    = h.first_at
     , latest_at   = h.latest_at
     , previous_at = h.previous_at
FROM  (
   SELECT u.id, f.first_at, l.last[1] AS latest_at, l.last[2] AS previous_at
   FROM   users u
   CROSS  JOIN LATERAL (
      SELECT ARRAY (
         SELECT h.created_at
         FROM   history h
         WHERE  h.user_id = u.id
         AND    h.type = 'SomeType'  -- ??
         ORDER  BY h.created_at DESC
         LIMIT  2
         ) AS last
      ) l
   CROSS  JOIN LATERAL (
      SELECT created_at AS first_at
      FROM   history h
      WHERE  h.user_id = u.id
      AND    h.type = 'SomeType'  -- ??
      ORDER  BY created_at
      LIMIT  1
      ) f
   WHERE  u.id BETWEEN $1 AND $2
   ) h
WHERE  u.id = h.id
AND   (u.first_at    IS DISTINCT FROM h.first_at
    OR u.latest_at   IS DISTINCT FROM h.latest_at
    OR u.previous_at IS DISTINCT FROM h.previous_at);

This works with non-unique timestamps per user_id, too.

And it's very efficient if there are many rows per user. It's designed to avoid a sequential scan on the big table and make heavy use of the index on (user_id, created_at) instead. Related:

Optimize GROUP BY query to retrieve latest row per user

Assuming most or all users get updated this way, we don't need an index on users. (For the purpose of this UPDATE, no index would be best.)

If there is only a single row in table history for a user, then previous_at is set to NULL. (Your original query has the same effect.)

Only users are updated where qualifying history rows are found.

This added WHERE clause skips updates that would not change anything (at full cost):

AND   (u.first_at    IS DISTINCT FROM h.first_at
    OR u.latest_at   IS DISTINCT FROM h.latest_at
    OR u.previous_at IS DISTINCT FROM h.previous_at)

See:

How do I (or can I) SELECT DISTINCT on multiple columns?

The only insecurity is with WHERE type = 'SomeType'. If that's selective, a partial index with the same predicate would be better. Then we could even get index-only scans ...

Since the new query should be much faster, you might update more (or all) users at once.

Thanks, I'll give it a shot tomorrow. What's the purpose of the distinct from clauses? Is there any risk of locking the tables if I do it all at once? — Schwern, Sep 26 '21 at 02:06
`distinct from clauses`? You mean the two lateral subqueries? The `history` table is only read and not locked at all. Updated rows of `users` are locked till the end of the transaction. — Erwin Brandstetter, Sep 26 '21 at 02:09
The `u.first_at IS DISTINCT FROM h.first_at` clauses. What is their purpose? — Schwern, Sep 26 '21 at 02:11
I clarified explanation above. More details in the linked answer: https://stackoverflow.com/a/12632129/939860. Also, to be clear: Updated rows of `users` are write-locked till the end of the transaction. (Readers are not blocked.) — Erwin Brandstetter, Sep 26 '21 at 02:14
This is significantly faster, amazing! This backfill should be done before Monday morning now. Thank you, I appreciate your help. I'm gonna go rewrite the other slow backfill queries. — Schwern, Sep 26 '21 at 19:20

Optimize updating first, last, and second to last ranked value

1 Answers1