0

I've tried a lot of the answers given on stackoverflow, but they're all very poor performance. They all suggest either using subqueries or joins with the same table where the join is on p1.pid = p2.pid AND p1.entryTime < p2.entryTime. Both are very slow options.

Is there a performant way to query this?

DATA

PID    STATUS    ENTRYTIME
---    ------    ---------
001    start     2016-05-26 01:18:00
002    start     2016-05-26 01:19:00
001    done      2016-05-26 02:11:00
003    start     2016-05-26 11:12:00
002    done      2016-05-26 11:19:00
003    inprog    2016-05-26 12:15:00

WHAT SHOULD BE RETURNED

PID    STATUS    ENTRYTIME
---    ------    ---------
001    done      2016-05-26 02:11:00
002    done      2016-05-26 11:19:00
003    inprog    2016-05-26 12:15:00

I cannot figure out how to do this in one query that isn't extremely slow.

I've tried queries adapted form these answers:

Retrieving the last record in each group
Get records with max value for each group of grouped SQL results
https://stackoverflow.com/a/17038667/857025

Community
  • 1
  • 1
Don Rhummy
  • 24,730
  • 42
  • 175
  • 330

4 Answers4

1

Have you tried making the aggregate first and join it with the table

select a.pid, a.lastentrytime, b.status from
(select pid, max(entrytime) lastentrytime from table_name group by pid) a 
inner join table_name b on a.pid = b.pid and a.lastentrytime = b.entrytime
Turo
  • 4,724
  • 2
  • 14
  • 27
  • Best answer! Played around with about a million records. The above join approach runs for about 5 sec, my analytic function approach goes for 15 sec, and the sub-query approach drops the connection after 10 min... – Koshera May 31 '16 at 15:33
  • @Turo Am I wrong that this query wouldn't be even simpler and give the proper results? `SELECT pid, MAX(entryTime), status FROM table_name GROUP BY pid;` Why have the join? – Don Rhummy Jun 01 '16 at 15:46
  • Mysql is not my usual db, and my query works on everey rdbms. I read about this feature, but im not sure wether the right status will be listed, see http://stackoverflow.com/questions/1023347/mysql-selecting-a-column-not-in-group-by – Turo Jun 01 '16 at 17:14
  • @Don Rhummy, if you just run the query that you suggest, status will not be part of the group by expression so you will have to do an aggregate function on it and this will mess up the results... That's why Turo is doing the group by in the inner query without status and then joins to get the status value outside... – Koshera Jun 01 '16 at 18:21
0

One method uses a correlated subquery:

select t.*
from t
where t.entrytime = (select max(t2.entrytime) from t t2 where t2.pid = t.pid);

For performance, you want an index on t(pid, entrytime).

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
0

Sometimes you have to denormalize. Disk space is cheap, and an extra insert is cheap if you are querying the data multiple times. I would add a trigger to the first table that keeps a second table with the final status updated. Actually I would have done the reverse, I would have the current_status table be the one I insert or update on, and have the table you show be my history table populated by a trigger.

Tony BenBrahim
  • 7,040
  • 2
  • 36
  • 49
0

The amount of SQL that you have to write in MySQL in order to simulate an Analytic Function is ridiculous... This should do the trick to get the last values of entrytime and status by a partition of pid. This is something that can be done with a single line in most RDBMS... If you want to get the first values - change the order by of entrytime to asc. The query repeats them for every row though - that's why I have a distinct on top. Please test for performance and let me know how it behaves. It should be faster than sub-queries but you never know...

SELECT  distinct  
pid,
@s as `LAST_VALUE (status) OVER (PARTITION BY pid ORDER BY pid)`,
@r AS `LAST_VALUE(entrytime) OVER (PARTITION BY pid ORDER BY pid)`
FROM    (
        SELECT  m.*
        FROM    (
                SELECT  @_pid = NULL
                ) vars,
            test m
        ORDER BY
        pid, entrytime desc
        ) mo   WHERE   (CASE 
        WHEN @_pid IS NULL OR @_pid <> pid 
            THEN @r := entrytime 
            ELSE entrytime 
        END IS NOT NULL 
        AND
        CASE 
        WHEN @_pid IS NULL OR @_pid <> pid 
            THEN @s := status 
            ELSE entrytime 
        END IS NOT NULL)  AND (@_pid := pid) IS NOT NULL

Sorry about the formatting - the code formatter in StackOverflow messed it up a bit...

Koshera
  • 439
  • 5
  • 14