Effectively select latest row for each group in a very large table?

Question

I have (for example's sake) a table Users (user_id, status, timestamp, ...).
I also have another table SpecialUsers (user_id, ...).

I need to show each special user's latest status.

The problem is that the Users table is VERY, VERY LARGE (more than 50 Billion rows). Most of the solutions in for instance this question just hang or get "disk full" error.

SpecialUsers table is much smaller - "only" 600K rows.

SELECT DISTINCT ON() is not supported. Working on Amazon RedShift.

EDIT: per request to see the failed attempts - one of those resulting in the disk full error is like this:

with users_with_status (user_id, status, timestamp)
as (
        select su.user_id, u.instance_type, u.timestamp
        from specialusers su
        join users u on su.user_id = u.user_id
)
select l.instance_id, l.instance_type
from users_with_status l
left outer join users_with_status r
     on l.user_id = r.user_id and l.timestamp < r.timestamp
where r.timestamp is null;

I know that I'm joining a bug table with itself but was hoping that the first join with small table would reduce the number of processed rows.

Anyway, seems that window functions is the solution here.

Can you post your SQL(s)? Many cases where disks fill up it is due to under qualified joins or inequality join on clauses. These cause an explosion of rows in the intermediate results which is what is filling up the disks. — Bill Weiner, Jul 12 '21 at 15:21
It is likely that the disks filling up is due to the ON clause "l.timestamp < r.timestamp". This creates a data image of the possible join combinations which could be massive. The WHERE clause cannot be applied until all this joining is complete. — Bill Weiner, Jul 15 '21 at 03:12

score 3 · Accepted Answer · answered Jul 12 '21 at 14:31

3

Perhaps a join with a window function will work:

select su.*
from (select s.user_id, u.status, u.timestamp,
             max(u.timestamp) over (partition by s.user_id) as max_timestamp
      from specialusers s join
           users u
           on s.user_id = u.user_id
     ) su
where timestamp = max_timestamp;

This specifically uses max() instead of row_number() on the speculation that it might use slightly fewer resources.

answered Jul 12 '21 at 14:31

Gordon Linoff

1,242,037
58
646
786

thanks, this nailed it basically. However I discovered that I have several rows in the big table with the same timestamp up to a microsecond :) so going to try row_number(). I presume I can also use `group by`. – davka Jul 14 '21 at 12:50
and it worked. Didn't notice a change in the execution time. Thanks! – davka Jul 14 '21 at 15:03

Effectively select latest row for each group in a very large table?

1 Answers1