most efficient way to select duplicate rows with max timestamp

Question

Suppose I have a table called t, which is like

id  content  time
1     'a'     100
1     'a'     101
1     'b'     102
2     'c'     200
2     'c'     201

id are duplicate, and for the same id, content could also be duplicate. Now I want to select for each id the rows with max timestamp, which would be

id  content  time
1      'b'    102
2      'c'    201

And this is my current solution:

select t1.id, t1.content, t1.time 
from (
  select id, content, time from t 
) as t1 
right join (
  select id, max(time) as time from t group by id
) as t2 
on t1.id = t2.id and t1.time = t2.time;

But this looks inefficient to me. Because theoretically when select id, max(time) as time from t group by id is executed, the rows I want have already been located. The right join brings extra O(n^2) time cost, which seems unnecessary.

So is there any more efficient way to do it, or anything that I missunderstand?

https://stackoverflow.com/questions/3800551/select-first-row-in-each-group-by-group/7630564#7630564 — jian, Mar 17 '22 at 06:01

Tim Biegeleisen · Accepted Answer · 2022-03-18T02:55:05.457

1

Use DISTINCT ON:

SELECT DISTINCT ON (id) id, content, time
FROM yourTable
ORDER BY id, time DESC;

On Postgres, this is usually the most performant way to write your query, and it should outperform ROW_NUMBER and other approaches.

The following index might speed up this query:

CREATE INDEX idx ON yourTable (id, time DESC, content);

This index, if used, would let Postgres rapidly find, for each id, the record having the latest time. This index also covers the content column.

edited Mar 18 '22 at 02:55

answered Mar 17 '22 at 04:07

Tim Biegeleisen

502,043
27
286
360

is `order by` necessary, considering I just want a max? – taotsi Mar 17 '22 at 09:05
Yes `ORDER BY` is very necessary to make `DISTINCT ON` work. If you need some _other_ ordering, then you may subquery and add a different order by clause. – Tim Biegeleisen Mar 17 '22 at 09:07
@taotsi without the ORDER BY, how would it know you want the max? It needs to informed of that somehow. With DISTINCT ON, ordering is the mechanism which informs it. – jjanes Mar 17 '22 at 15:52
@TimBiegeleisen I mean, generally, sorting takes O(nlogn) time, while finding the max takes only O(n). I just want the max, and don't care the second max or the third max... If you're doing this using some dataframe library like pandas, you can find the max without sorting. I wonder if SQL can do that – taotsi Mar 18 '22 at 02:41
@taotsi I think what is you're looking for then is an _index_ on your table which would allow for log(n) search time, for each `id` value, q.v. my updated answer above. – Tim Biegeleisen Mar 18 '22 at 02:55

score 0 · Answer 2 · answered Mar 17 '22 at 04:53

0

Try this

SELECT a.id, a.content, a.time FROM t AS a
INNER JOIN (
    SELECT a.content, MAX(a.time) AS time FROM t
    GROUP BY a.content
) AS b ON a.content = b.content AND a.time = b.time

answered Mar 17 '22 at 04:53

Nayanish Damania

542
5
13

`content` could be very long strings. I think it's less efficient to compare `content` than `id` – taotsi Mar 17 '22 at 09:07

most efficient way to select duplicate rows with max timestamp

2 Answers2