2

I have a table "UserData" with the following information in it:

User    Date    DateTime        Input
1   8/4/2019    8/4/2019 0:55   Request
1   8/4/2019    8/4/2019 0:56   Ticket
1   8/4/2019    8/4/2019 2:08   Submit
1   8/4/2019    8/4/2019 2:21   Submit
2   8/4/2019    8/4/2019 13:10  Submit
2   8/20/2019   8/20/2019 2:10  Ticket
2   8/20/2019   8/20/2019 2:12  Ticket
2   8/20/2019   8/20/2019 2:13  Request
3   8/20/2019   8/20/2019 2:15  Request
3   8/19/2019   8/19/2019 2:16  Ticket
3   6/12/2020   6/12/2020 2:22  Submit
3   6/12/2020   6/12/2020 2:26  Submit
3   6/12/2020   6/12/2020 3:26  Ticket

I'm looking to return the first input from each user (based on time) but only on the newest day. So my output would be something like this:

User    DateTime    Input
1   8/4/2019 0:55   Request
2   8/20/2019 2:10  Ticket
3   6/12/2020 2:22  Submit

I'm thinking that I need to assign a rank to each date but not really sure where to start on this one.

GMB
  • 216,147
  • 25
  • 84
  • 135
Ryan
  • 31
  • 3
  • Your Postgres version and basic data distribution are essential to identify the best solution. How many rows total and how many distinct users, roughly? Also, is there a separate `users` table with one row per relevant user? – Erwin Brandstetter Mar 10 '20 at 23:42

4 Answers4

2

I understand that you want the earliest record per user and per day. In Postgres, you can simply use distinct on to solve this top-1-per-group problem:

select distinct on (u.user, u.date) u.*
from userData u
order by u.user, u.date, u.datetime

If you want the earliest record per user, regardless of the day, then just:

select distinct on (u.user) u.*
from userData u
order by u.user, u.datetime

Edit: if you want the earliest record on the latest day, then:

select distinct on (u.user) u.*
from userData u
order by u.user, u.date desc, u.datetime
GMB
  • 216,147
  • 25
  • 84
  • 135
  • "Only on the newest day" was what they asked. (Not my downvote) – Vesa Karjalainen Mar 10 '20 at 21:56
  • This construct is awesome. I'd never come across it. However, the output he desired is "Oldest record on the newest day". So the sort would be `u.user, u.date DESC, u.datetime` – Chris Hep Mar 10 '20 at 21:56
  • 1
    @ChrisHep: ah yes I think that you are right. That's just a slight change in the query, I edited my answer. Thank you. – GMB Mar 10 '20 at 21:59
0

With row_number() window function:

select t."User", t."Date", t."DateTime", t."Input"
from (
  select *, row_number() over (partition by "User" order by "Date" desc, "DateTime") rn
  from UserData
) t
where t.rn = 1

For every user the row are sorted by "Date" descending to find the latest date and then by "DateTime" ascending to get the 1st input of that day.
See the demo.
Results:

| User | Date       | DateTime         | Input   |
| ---- | ---------- | ---------------- | ------- |
| 1    | 2019-08-04 | 2019-08-04 00:55 | Request |
| 2    | 2019-08-20 | 2019-08-20 02:10 | Ticket  |
| 3    | 2020-06-12 | 2020-06-12 02:22 | Submit  |
forpas
  • 160,666
  • 10
  • 38
  • 76
0
SELECT * 
FROM ( User,
       Date,
       Datetime, 
       Input,
       ROW_NUMBER() OVER (
           PARTITION BY User 
           ORDER BY Datetime DESC) dataOrder
       FROM UserData) z
WHERE z.dataOrder = 1
belwood
  • 3,320
  • 11
  • 38
  • 45
  • While this code may answer the question, providing additional context regarding why and/or how this code answers the question improves its long-term value. – Igor F. Mar 11 '20 at 09:18
0

I understand you want:

the row with the earliest time of the latest day for every user

Table design

For starters: drop the Date column. Storing it redundantly adds more cost and complications than it's worth:

CREATE TABLE userdata (
  user_id  int
, datetime timestamp
, input    text
);

input should really be some cheap enumeration implementation (enum, FK, ...).
timestamptz might be the appropriate type for datetime. Depends. See:

Index

Either way, to make your operation fast, this is the perfect index:

CREATE INDEX userdata_special_idx ON userdata
(user_id, (datetime::date) DESC NULLS LAST, datetime);

datetime::date is a very cheap cast, replacing your redundant date column. I still add the date to the multicolumn expression index for performance. (The date depends on the time zone when working with timestamptz. If you deal with multiple time zones you need to do more.)

Note the added NULLS LAST: since nothing in your question says that the the timestamp is NOT NULL, you need this in the query to prevent nonsense results - and the index must match for best results. See:

Query

For only few rows per user, DISTINCT ON should be the best choice (like GMB already suggested) - simple and fast:

SELECT DISTINCT ON (user_id)
       user_id, datetime, input 
FROM   userdata
ORDER  BY user_id, datetime::date DESC NULLS LAST, datetime;

See:

For many rows per user, this alternative query should be (substantially) faster:

SELECT u.user_id, d.*
FROM   users u
LEFT   JOIN LATERAL (
   SELECT d.datetime, d.input 
   FROM   userdata d
   WHERE  d.user_id = u.user_id         -- lateral reference
   ORDER  BY d.datetime::date DESC NULLS LAST, d.datetime
   LIMIT  1
   ) d ON true;

Typically, this is the way to go for your scenario.

Note the LEFT JOIN: it returns a row for every user, even with no entries in userdata. If that's not desirable, use CROSS JOIN instead. Related:

This assumes the existence of a users table, which typically exists. If it doesn't, I suggest you add it (for multiple reasons). If that's not an option, there are still fast workarounds. See:

db<>fiddle here

Aside: I warmly recommend to always use ISO date format (as does the manual). Regional format depends on settings of the current session and may fail in ugly ways.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228