Determining first activity for the day per user

Question

I have a table "UserData" with the following information in it:

User    Date    DateTime        Input
1   8/4/2019    8/4/2019 0:55   Request
1   8/4/2019    8/4/2019 0:56   Ticket
1   8/4/2019    8/4/2019 2:08   Submit
1   8/4/2019    8/4/2019 2:21   Submit
2   8/4/2019    8/4/2019 13:10  Submit
2   8/20/2019   8/20/2019 2:10  Ticket
2   8/20/2019   8/20/2019 2:12  Ticket
2   8/20/2019   8/20/2019 2:13  Request
3   8/20/2019   8/20/2019 2:15  Request
3   8/19/2019   8/19/2019 2:16  Ticket
3   6/12/2020   6/12/2020 2:22  Submit
3   6/12/2020   6/12/2020 2:26  Submit
3   6/12/2020   6/12/2020 3:26  Ticket

I'm looking to return the first input from each user (based on time) but only on the newest day. So my output would be something like this:

User    DateTime    Input
1   8/4/2019 0:55   Request
2   8/20/2019 2:10  Ticket
3   6/12/2020 2:22  Submit

I'm thinking that I need to assign a rank to each date but not really sure where to start on this one.

Your Postgres version and basic data distribution are essential to identify the best solution. How many rows total and how many distinct users, roughly? Also, is there a separate `users` table with one row per relevant user? — Erwin Brandstetter, Mar 10 '20 at 23:42

GMB · Answer 1 · 2020-03-10T21:58:06.137

2

I understand that you want the earliest record per user and per day. In Postgres, you can simply use distinct on to solve this top-1-per-group problem:

select distinct on (u.user, u.date) u.*
from userData u
order by u.user, u.date, u.datetime

If you want the earliest record per user, regardless of the day, then just:

select distinct on (u.user) u.*
from userData u
order by u.user, u.datetime

Edit: if you want the earliest record on the latest day, then:

select distinct on (u.user) u.*
from userData u
order by u.user, u.date desc, u.datetime

edited Mar 10 '20 at 21:58

answered Mar 10 '20 at 21:37

GMB

216,147
25
84
135

"Only on the newest day" was what they asked. (Not my downvote) – Vesa Karjalainen Mar 10 '20 at 21:56
This construct is awesome. I'd never come across it. However, the output he desired is "Oldest record on the newest day". So the sort would be `u.user, u.date DESC, u.datetime` – Chris Hep Mar 10 '20 at 21:56
1

@ChrisHep: ah yes I think that you are right. That's just a slight change in the query, I edited my answer. Thank you. – GMB Mar 10 '20 at 21:59

forpas · Accepted Answer · 2020-03-10T21:39:04.183

With row_number() window function:

select t."User", t."Date", t."DateTime", t."Input"
from (
  select *, row_number() over (partition by "User" order by "Date" desc, "DateTime") rn
  from UserData
) t
where t.rn = 1

For every user the row are sorted by "Date" descending to find the latest date and then by "DateTime" ascending to get the 1st input of that day.
See the demo.
Results:

| User | Date       | DateTime         | Input   |
| ---- | ---------- | ---------------- | ------- |
| 1    | 2019-08-04 | 2019-08-04 00:55 | Request |
| 2    | 2019-08-20 | 2019-08-20 02:10 | Ticket  |
| 3    | 2020-06-12 | 2020-06-12 02:22 | Submit  |

score 0 · Answer 3 · edited Mar 11 '20 at 00:16

0

SELECT * 
FROM ( User,
       Date,
       Datetime, 
       Input,
       ROW_NUMBER() OVER (
           PARTITION BY User 
           ORDER BY Datetime DESC) dataOrder
       FROM UserData) z
WHERE z.dataOrder = 1

edited Mar 11 '20 at 00:16

belwood

3,320
11
38
45

answered Mar 10 '20 at 21:57

Alvaro Vasquez Jr.

1

While this code may answer the question, providing additional context regarding why and/or how this code answers the question improves its long-term value. – Igor F. Mar 11 '20 at 09:18

Erwin Brandstetter · Answer 4 · 2020-03-11T00:37:13.290

I understand you want:

the row with the earliest time of the latest day for every user

Table design

For starters: drop the Date column. Storing it redundantly adds more cost and complications than it's worth:

CREATE TABLE userdata (
  user_id  int
, datetime timestamp
, input    text
);

input should really be some cheap enumeration implementation (enum, FK, ...).
timestamptz might be the appropriate type for datetime. Depends. See:

Ignoring time zones altogether in Rails and PostgreSQL

Index

Either way, to make your operation fast, this is the perfect index:

CREATE INDEX userdata_special_idx ON userdata
(user_id, (datetime::date) DESC NULLS LAST, datetime);

datetime::date is a very cheap cast, replacing your redundant date column. I still add the date to the multicolumn expression index for performance. (The date depends on the time zone when working with timestamptz. If you deal with multiple time zones you need to do more.)

Note the added NULLS LAST: since nothing in your question says that the the timestamp is NOT NULL, you need this in the query to prevent nonsense results - and the index must match for best results. See:

PostgreSQL sort by datetime asc, null first?

Query

For only few rows per user, DISTINCT ON should be the best choice (like GMB already suggested) - simple and fast:

SELECT DISTINCT ON (user_id)
       user_id, datetime, input 
FROM   userdata
ORDER  BY user_id, datetime::date DESC NULLS LAST, datetime;

See:

Select first row in each GROUP BY group?

For many rows per user, this alternative query should be (substantially) faster:

SELECT u.user_id, d.*
FROM   users u
LEFT   JOIN LATERAL (
   SELECT d.datetime, d.input 
   FROM   userdata d
   WHERE  d.user_id = u.user_id         -- lateral reference
   ORDER  BY d.datetime::date DESC NULLS LAST, d.datetime
   LIMIT  1
   ) d ON true;

Typically, this is the way to go for your scenario.

Note the LEFT JOIN: it returns a row for every user, even with no entries in userdata. If that's not desirable, use CROSS JOIN instead. Related:

What is the difference between LATERAL and a subquery in PostgreSQL?

This assumes the existence of a users table, which typically exists. If it doesn't, I suggest you add it (for multiple reasons). If that's not an option, there are still fast workarounds. See:

Optimize GROUP BY query to retrieve latest row per user

db<>fiddle here

Aside: I warmly recommend to always use ISO date format (as does the manual). Regional format depends on settings of the current session and may fail in ugly ways.

How to create query with format date DD-MM-YYYY

Determining first activity for the day per user

4 Answers4

Table design

Index

Query