Select row with most recent date per user

Question

I have a table ("lms_attendance") of users' check-in and out times that looks like this:

id  user    time    io (enum)
1   9   1370931202  out
2   9   1370931664  out
3   6   1370932128  out
4   12  1370932128  out
5   12  1370933037  in

I'm trying to create a view of this table that would output only the most recent record per user id, while giving me the "in" or "out" value, so something like:

id  user    time    io
2   9   1370931664  out
3   6   1370932128  out
5   12  1370933037  in

I'm pretty close so far, but I realized that views won't accept subquerys, which is making it a lot harder. The closest query I got was :

select 
    `lms_attendance`.`id` AS `id`,
    `lms_attendance`.`user` AS `user`,
    max(`lms_attendance`.`time`) AS `time`,
    `lms_attendance`.`io` AS `io` 
from `lms_attendance` 
group by 
    `lms_attendance`.`user`, 
    `lms_attendance`.`io`

But what I get is :

id  user    time    io
3   6   1370932128  out
1   9   1370931664  out
5   12  1370933037  in
4   12  1370932128  out

Which is close, but not perfect. I know that last group by shouldn't be there, but without it, it returns the most recent time, but not with it's relative IO value.

Any ideas? Thanks!

possible duplicate of [How to select the most recent set of dated records from a mysql table](http://stackoverflow.com/questions/435703/how-to-select-the-most-recent-set-of-dated-records-from-a-mysql-table) — Barmar, Jun 11 '13 at 07:01
Go back to the manual. You'll see that it offers solutions to this problem both with and without (correlated and uncorelated) subqueries. — Strawberry, Jun 11 '13 at 07:13
@Barmar, technically, as I pointed out in my answer, this is a duplicate of all 700 questions with the [tag:greatest-n-per-group] tag. — Tomas, Jun 11 '13 at 07:51
I had a column called "IO" which stands for "in or out", it was an enum type with possible values "in" or "out". This was used to keep track of when people checked in and out of a class. — Keith, May 10 '17 at 02:04
This works for me, but takes too long; if I do "select * from data" it takes 0.15 sec. If I follow the algorithm above it takes 23 seconds! — beep_check, May 19 '17 at 20:44

score 263 · Accepted Answer · edited Dec 17 '21 at 20:13

263

Query:

SQLFIDDLEExample

SELECT t1.*
FROM lms_attendance t1
WHERE t1.time = (SELECT MAX(t2.time)
                 FROM lms_attendance t2
                 WHERE t2.user = t1.user)

Result:

| ID | USER |       TIME |  IO |
--------------------------------
|  2 |    9 | 1370931664 | out |
|  3 |    6 | 1370932128 | out |
|  5 |   12 | 1370933037 |  in |

Note that if a user has multiple records with the same "maximum" time, the query above will return more than one record. If you only want 1 record per user, use the query below:

SQLFIDDLEExample

SELECT t1.*
FROM lms_attendance t1
WHERE t1.id = (SELECT t2.id
                 FROM lms_attendance t2
                 WHERE t2.user = t1.user            
                 ORDER BY t2.id DESC
                 LIMIT 1)

edited Dec 17 '21 at 20:13

Stevoisiak

23,794
27
122
225

answered Jun 11 '13 at 07:22

Justin

9,634
6
35
47

2

wow! not only did this work, i was allowed to create a view with this query even though it contains subqueries. before, when i tried to create a view containing subqueries, it didn't let me. are there rules as to why this is allowed but another one isn't? – Keith Jun 11 '13 at 07:24
very weird. thanks a ton! maybe it was because my subquery was a pseudo table that i was selecting FROM, where in this example its used in the WHERE clause. – Keith Jun 11 '13 at 07:27
5

No need for subqueries! Moreover, this solution [doesn't work if there are two records with exactly the same time](http://sqlfiddle.com/#!2/dbb2d/2). There is no need to try reinvent the wheel every time, as this is common problem - instead, go for already tested and optimized solutions - @Prodikl see my answer. – Tomas Jun 11 '13 at 07:42
ah, thanks for the insight! i'll try the new code when i'm in the office tomorrow. – Keith Jun 11 '13 at 11:46
Beware to those using temp tables in MySQL; I got `ERROR 1137: Can't reopen table: 'temp_table'`, since the query refers to the same temporary table multiple times under different aliases. (Also want to add my 2 cents...queries like this are hard to debug, since the inner `SELECT` can't be run independently). – tumultous_rooster Nov 16 '15 at 07:38
5

@TMS This solution does work if the records have the exact same time, since the query is locating the record with the greatest id. This implies that the time in the table is the insertion time, which may not be a good assumption. Your solution instead compares timestamps and, when two timestamps are identical, you return the row with the greatest id as well. Hence, your solution also assumes that the timestamp in this table is related to the order of insertion, which is the largest flaw with both of your queries. – WebWanderer Jun 08 '16 at 19:42
@MattO'Brien I agree that a subquery is unfavorable since it is impossible to be run individually, yet very simple suqueries like the one suggested are fine in my opinion. I personally prefer the subquery approach for the fact that it is easier to understand than the hash-anti-join approach suggested by TMS. – WebWanderer Jun 08 '16 at 19:46
Wait a minute: PostgreSQL 9.5 ERROR: "`aggregate functions are not allowed in WHERE`." Your first suggestion is not well supported, but your "Solution which gonna work everytime" as you say, is the correct answer. – WebWanderer Jun 21 '16 at 16:30
@WebWanderer nope, my solution *does not* assume anything about how timestamps are related to ids. I need just *any* criteria that will assure that for two records with the same time only one is chosen - and id is good for that since no records have the same id – Tomas Nov 15 '16 at 11:32
Can anyone make this same query work for a DELETE instead? I am using this, but it fails when trying to DELETE. – REAL O G Mar 28 '18 at 18:01
I'm facing the same problem and the person before me used the same solution as op. However, this solution is not scaleable when you have millions of rows. This is because so many subqueries would need to be run. – Jason Cheng Mar 26 '20 at 03:40
I don't understand how this can work. Is this some kind of self join? – Kosta S. Aug 06 '20 at 13:37
Why doesn't this return false positives if one user has records that match the most recent time for another user? Eg: If user 1 has a record at 4:00, and user 2 has records at 4:00 and 5:00, why wouldn't both records from user 2 match? – Stevoisiak Dec 17 '21 at 19:31

score 95 · Answer 2 · edited May 23 '17 at 12:18

95

No need to trying reinvent the wheel, as this is common greatest-n-per-group problem. Very nice solution is presented.

I prefer the most simplistic solution (see SQLFiddle, updated Justin's) without subqueries (thus easy to use in views):

SELECT t1.*
FROM lms_attendance AS t1
LEFT OUTER JOIN lms_attendance AS t2
  ON t1.user = t2.user 
        AND (t1.time < t2.time 
         OR (t1.time = t2.time AND t1.Id < t2.Id))
WHERE t2.user IS NULL

This also works in a case where there are two different records with the same greatest value within the same group - thanks to the trick with (t1.time = t2.time AND t1.Id < t2.Id). All I am doing here is to assure that in case when two records of the same user have same time only one is chosen. Doesn't actually matter if the criteria is Id or something else - basically any criteria that is guaranteed to be unique would make the job here.

edited May 23 '17 at 12:18

Community

1
1

answered Jun 11 '13 at 07:27

Tomas

57,621
49
238
373

1

The max uses `t1.time < t2.time` and the min would be `t1.time > t2.time` which is the opposite of my initial intuition. – None May 12 '15 at 00:44
1

@J.Money because there is implicit negation hidden: you select all records from t1 which *don't have* corresponding record from t2 where the `t1.time < t2.time` condition applies :-) – Tomas May 12 '15 at 14:37
4

`WHERE t2.user IS NULL` is a bit strange. What role does this line play? – tumultous_rooster Nov 16 '15 at 07:41
2

The accepted answer, posted by Justin, may be more optimal. The accepted answer uses a backward index scan on the primary key of the table, followed by a limit, followed by a sequence scan of the table. Therefore, the accepted answer can be greatly optimized with an additional index. This query could be optimized by an index as well, as it performs two sequence scans, yet also includes a hash and a "hash-anti-join" of the results of the sequence scan and the hash of the other sequence scan. I would be interested in an explanation of which approach is truly more optimal. – WebWanderer Jun 08 '16 at 19:52
@TMS could you please clarify `OR (t1.time = t2.time AND t1.Id < t2.Id))` section? – Oleg Kuts Nov 14 '16 at 19:36
@OlegKuts this is to handle the case when two records have the same time. Please see the updated answer – Tomas Nov 15 '16 at 11:36
@TMS thanks for reply! In my case it seems to work without OR section. I have 993 items with "Year" int column that has range from 2006 to 2016, so there should be allot of duplicates with only `AND (t1.Year < t2.Year)` section, but there is none. My query looks like this `SELECT t1.* FROM MyTable AS t1 LEFT JOIN MyTable AS t2 ON t1.Id = t2.Id AND t1.Series = t2.Series AND (t1.Year < t2.Year) WHERE t2.Id IS NULL AND t2.Series IS NULL` (I have a composite PK (Id,Series)) – Oleg Kuts Nov 15 '16 at 11:59
@OlegKuts the exact condition of the duplicate is "two different records with the same greatest value within the same group". Under this circumstance, you will *for sure* have both records in the output. – Tomas Nov 15 '16 at 12:24
@TMS I've find this to have horrible performance issues on bigger tables. Another query that I'm currently using is 'SELECT t1.* FROM table1 AS t1 WHERE date = (SELECT max(t2.date) FROM table2 as t2 WHERE t1.id = t2.id)' . For tables under 2k rows it runs the same, but for my table with 63k rows your variant runs more than 4 sec, and this one less that 1 sec. Could you provide some clarifications please? – Oleg Kuts Dec 02 '16 at 14:11
Oleg Kuts, you could add some WHERE conditions to limit the rows to scan, like limiting to "current year rows". This way big tables become more manageable, the impact on performance is noticeable with the proper index – Juanga Covas Dec 19 '19 at 03:47

user1792210 · Answer 3 · 2016-03-16T18:35:49.390

6

Based in @TMS answer, I like it because there's no need for subqueries but I think ommiting the 'OR' part will be sufficient and much simpler to understand and read.

SELECT t1.*
FROM lms_attendance AS t1
LEFT JOIN lms_attendance AS t2
  ON t1.user = t2.user 
        AND t1.time < t2.time
WHERE t2.user IS NULL

if you are not interested in rows with null times you can filter them in the WHERE clause:

SELECT t1.*
FROM lms_attendance AS t1
LEFT JOIN lms_attendance AS t2
  ON t1.user = t2.user 
        AND t1.time < t2.time
WHERE t2.user IS NULL and t1.time IS NOT NULL

edited Mar 16 '16 at 18:35

answered Mar 11 '16 at 18:07

user1792210

155
3
8

1

Omitting the `OR` part is a really bad idea if two records can have same `time`. – Tomas Nov 15 '16 at 11:39
I would avoid this solution for performance sake. As @OlegKuts mentioned, this gets very slow on mid-to-large data sets. – Peter Meadley Mar 13 '18 at 10:02

score 6 · Answer 4 · answered Mar 02 '20 at 13:35

6

If your on MySQL 8.0 or higher you can use Window functions:

Query:

DBFiddleExample

SELECT DISTINCT
FIRST_VALUE(ID) OVER (PARTITION BY lms_attendance.USER ORDER BY lms_attendance.TIME DESC) AS ID,
FIRST_VALUE(USER) OVER (PARTITION BY lms_attendance.USER ORDER BY lms_attendance.TIME DESC) AS USER,
FIRST_VALUE(TIME) OVER (PARTITION BY lms_attendance.USER ORDER BY lms_attendance.TIME DESC) AS TIME,
FIRST_VALUE(IO) OVER (PARTITION BY lms_attendance.USER ORDER BY lms_attendance.TIME DESC) AS IO
FROM lms_attendance;

Result:

| ID | USER |       TIME |  IO |
--------------------------------
|  2 |    9 | 1370931664 | out |
|  3 |    6 | 1370932128 | out |
|  5 |   12 | 1370933037 |  in |

The advantage I see over using the solution proposed by Justin is that it enables you to select the row with the most recent data per user (or per id, or per whatever) even from subqueries without the need for an intermediate view or table.

And in case your running a HANA it is also ~7 times faster :D

answered Mar 02 '20 at 13:35

whme

4,908
5
15
28

Would you need to add `FIRST_VALUE()` to every field you want to pull? – Stevoisiak Dec 02 '21 at 20:32
Since OP asked about getting the value with the most recent date per user, this requires to order by date and take the first value. If your not reducing the result set per window function down to 1 row somehow, there is no point using it I guess – whme Dec 03 '21 at 06:23
I more so meant, is there a way to avoid the repeated `FIRST_VALUE()` and `PARTITION BY ORDER BY DESC` on every value you want to pull? – Stevoisiak Dec 03 '21 at 18:19
I think so.. but I am not sure. Maybe this would make a good SO question? – whme Dec 06 '21 at 09:32

davmos · Answer 5 · 2017-03-23T09:34:06.643

Already solved, but just for the record, another approach would be to create two views...

CREATE TABLE lms_attendance
(id int, user int, time int, io varchar(3));

CREATE VIEW latest_all AS
SELECT la.user, max(la.time) time
FROM lms_attendance la 
GROUP BY la.user;

CREATE VIEW latest_io AS
SELECT la.* 
FROM lms_attendance la
JOIN latest_all lall 
    ON lall.user = la.user
    AND lall.time = la.time;

INSERT INTO lms_attendance 
VALUES
(1, 9, 1370931202, 'out'),
(2, 9, 1370931664, 'out'),
(3, 6, 1370932128, 'out'),
(4, 12, 1370932128, 'out'),
(5, 12, 1370933037, 'in');

SELECT * FROM latest_io;

Click here to see it in action at SQL Fiddle

thanks for the follow up! yeah, i was going to create multiple views if there wasn't an easier way. thanks again — Keith, Jun 11 '13 at 07:32

score 1 · Answer 6 · answered Jun 26 '19 at 15:09

1

Ok, this might be either a hack or error-prone, but somehow this is working as well-

SELECT id, MAX(user) as user, MAX(time) as time, MAX(io) as io FROM lms_attendance GROUP BY id;

answered Jun 26 '19 at 15:09

kev

2,741
5
22
48

score 1 · Answer 7 · answered Dec 22 '21 at 11:27

1

I have tried one solution which works for me

    SELECT user, MAX(TIME) as time
      FROM lms_attendance
      GROUP by user
      HAVING MAX(time)

answered Dec 22 '21 at 11:27

Mukeem Ahmad

11
1

score 1 · Answer 8 · answered Aug 16 '22 at 13:49

I have a very large table and all of the other suggestions here were taking a very long time to execute. I came up with this hacky method that was much faster. The downside is, if the max(date) row has a duplicate date for that user, it will return both of them.

SELECT * FROM mb_web.devices_log WHERE CONCAT(dtime, '-', user_id) in (
    SELECT concat(max(dtime), '-', user_id) FROM mb_web.devices_log GROUP BY user_id
)

score 0 · Answer 9 · answered Jun 11 '13 at 07:18

0

select b.* from 

    (select 
        `lms_attendance`.`user` AS `user`,
        max(`lms_attendance`.`time`) AS `time`
    from `lms_attendance` 
    group by 
        `lms_attendance`.`user`) a

join

    (select * 
    from `lms_attendance` ) b

on a.user = b.user
and a.time = b.time

answered Jun 11 '13 at 07:18

chetan

2,876
1
14
15

thanks. i know i can do it using a subquery, but i was hoping to turn this into a view, and it won't allow subqueries in views AFAIK. would i have to turn each sub query into a view, etc.? – Keith Jun 11 '13 at 07:20
`join (select * from lms_attendance ) b` = `join lms_attendance b` – azerafati Nov 12 '16 at 15:14

score -1 · Answer 10 · answered Nov 04 '18 at 18:24

-1

 select result from (
     select vorsteuerid as result, count(*) as anzahl from kreditorenrechnung where kundeid = 7148
     group by vorsteuerid
 ) a order by anzahl desc limit 0,1

answered Nov 04 '18 at 18:24

Konstantin XFlash Stratigenas

605
6
10

score -1 · Answer 11 · answered Dec 04 '20 at 10:18

-1

I have done same thing like below

SELECT t1.* FROM lms_attendance t1 WHERE t1.id in (SELECT max(t2.id) as id FROM lms_attendance t2 group BY t2.user)

This will also reduce memory utilization.

Thanks.

answered Dec 04 '20 at 10:18

Rohit Ramani

776
7
15

score -3 · Answer 12 · answered Jun 11 '13 at 07:17

-3

Possibly you can do group by user and then order by time desc. Something like as below

  SELECT * FROM lms_attendance group by user order by time desc;

answered Jun 11 '13 at 07:17

user2365199

29
1
8

score -3 · Answer 13 · edited Sep 22 '17 at 05:38

-3

Try this query:

  select id,user, max(time), io 
  FROM lms_attendance group by user;

edited Sep 22 '17 at 05:38

dance2die

35,807
39
131
194

answered Jun 30 '17 at 07:57

Sugan

335
3
10

Try making a SQLFiddle of this. You will likely find that `id` and `io` are nonaggregated columns, which cannot be used in a `group by`. – Dewi Morgan Aug 13 '18 at 15:52
1

there is no guarantee id will be the id with max(time), it could be any of the ids within the group. this is the problem I came here to resolve, still looking – robisrob Oct 05 '18 at 15:48

score -3 · Answer 14 · answered Dec 13 '17 at 21:59

-3

This worked for me:

SELECT user, time FROM 
(
    SELECT user, time FROM lms_attendance --where clause
) AS T 
WHERE (SELECT COUNT(0) FROM table WHERE user = T.user AND time > T.time) = 0
ORDER BY user ASC, time DESC

answered Dec 13 '17 at 21:59

Alvaro Sifuentes

33
1

Select row with most recent date per user

14 Answers14

Linked

Related