Get top 1 row of each group

Question

I have a table which I want to get the latest entry for each group. Here's the table:

DocumentStatusLogs Table

|ID| DocumentID | Status | DateCreated |
| 2| 1          | S1     | 7/29/2011   |
| 3| 1          | S2     | 7/30/2011   |
| 6| 1          | S1     | 8/02/2011   |
| 1| 2          | S1     | 7/28/2011   |
| 4| 2          | S2     | 7/30/2011   |
| 5| 2          | S3     | 8/01/2011   |
| 6| 3          | S1     | 8/02/2011   |

The table will be grouped by DocumentID and sorted by DateCreated in descending order. For each DocumentID, I want to get the latest status.

My preferred output:

| DocumentID | Status | DateCreated |
| 1          | S1     | 8/02/2011   |
| 2          | S3     | 8/01/2011   |
| 3          | S1     | 8/02/2011   |

Is there any aggregate function to get only the top from each group? See pseudo-code GetOnlyTheTop below:

SELECT
  DocumentID,
  GetOnlyTheTop(Status),
  GetOnlyTheTop(DateCreated)
FROM DocumentStatusLogs
GROUP BY DocumentID
ORDER BY DateCreated DESC

If such function doesn't exist, is there any way I can achieve the output I want?
Or at the first place, could this be caused by unnormalized database? I'm thinking, since what I'm looking for is just one row, should that status also be located in the parent table?

Please see the parent table for more information:

Current Documents Table

| DocumentID | Title  | Content  | DateCreated |
| 1          | TitleA | ...      | ...         |
| 2          | TitleB | ...      | ...         |
| 3          | TitleC | ...      | ...         |

Should the parent table be like this so that I can easily access its status?

| DocumentID | Title  | Content  | DateCreated | CurrentStatus |
| 1          | TitleA | ...      | ...         | s1            |
| 2          | TitleB | ...      | ...         | s3            |
| 3          | TitleC | ...      | ...         | s1            |

UPDATE I just learned how to use "apply" which makes it easier to address such problems.

For a more detailed discussion and comparison of possible solutions I recommend to read the similar question on dba.se: [Retrieving n rows per group](http://dba.stackexchange.com/q/86415/57105). — Vladimir Baranov, Nov 06 '16 at 10:33
I looked at the post and tried it. Using *group by StoreID* generated an error. — UltraJ, Sep 06 '18 at 21:32
Related: [Select first row in each GROUP BY group?](https://stackoverflow.com/questions/3800551/select-first-row-in-each-group-by-group) — Vadzim, Jun 14 '20 at 10:35
This question is for MS SQL Server, question [Select first row in each GROUP BY group](https://stackoverflow.com/questions/3800551/select-first-row-in-each-group-by-group) is for Postgres. The language constructs can be different. — Michael Freidgeim, May 17 '22 at 13:26

gbn · Accepted Answer · 2011-07-27T09:52:26.897

947

;WITH cte AS
(
   SELECT *,
         ROW_NUMBER() OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC) AS rn
   FROM DocumentStatusLogs
)
SELECT *
FROM cte
WHERE rn = 1

If you expect 2 entries per day, then this will arbitrarily pick one. To get both entries for a day, use DENSE_RANK instead

As for normalised or not, it depends if you want to:

maintain status in 2 places
preserve status history
...

As it stands, you preserve status history. If you want latest status in the parent table too (which is denormalisation) you'd need a trigger to maintain "status" in the parent. or drop this status history table.

edited Jul 27 '11 at 09:52

answered Jul 27 '11 at 08:44

gbn

422,506
82
585
676

5

And... What is `Partition By`? `With` is new to me also :( I'm using mssql 2005 anyway. – kazinix Jul 27 '11 at 08:48
9

@domanokz: Partition By resets the count. So in this case, it says to count per DocumentID – gbn Jul 27 '11 at 08:50
2

Hm, I worry about the performance, I'll be querying millions of rows. Is SELECT * FROM (SELECT ...) affects the performance? Also, is `ROW_NUMBER` some kind of a subquery for each row? – kazinix Jul 27 '11 at 09:21
1

@domanokz: no, it's not a subquery. If you have correct indexes then millions shouldn't be a problem. There are only 2 set based ways anyway: this and the aggregate (Ariel's solution). So try them both... – gbn Jul 27 '11 at 09:30
would you mind to look at my question again? I've edited it, I added the ID to DocumentStatusLogs table. I think it might help us to optimize the query. Thanks! – kazinix Jul 27 '11 at 09:50
1

@domanokz: Just change ORDER BY DateCreated DESC to ORDER BY ID DESC – gbn Jul 27 '11 at 09:52
1

This is great, I'm used to subquery for this task. I find this solutions much more appealing. – dbd Nov 15 '13 at 08:47
1

this so simple and effective. Much more efficient than some subqueries. Thank you ! – theSpyCry Feb 04 '15 at 14:11
What is the starting semicolon for? – Caltor Feb 06 '19 at 10:37
1

@Caltor I left the ; when having above in a larger SQL statement and got the following error: 'Incorrect syntax near the keyword 'with'. If this statement is a common table expression, an xmlnamespaces clause or a change tracking context clause, the previous statement must be terminated with a semicolon.' A GO just before - or no other statement - works as well. – Andreas Reiff May 21 '19 at 18:19
Used in my context, the CTE solution is much quicker than the CROSS APPLY: a couple of seconds in the first case (less than 10) vs 56 mins in the second one. That's appreciable! – user3341592 Nov 05 '20 at 20:12
This doesn't seem to work in sql server?! I get the error 'invalid column name rn' – niico May 30 '22 at 20:37

kazinix · Answer 2 · 2013-07-01T05:28:45.730

247

I just learned how to use cross apply. Here's how to use it in this scenario:

 select d.DocumentID, ds.Status, ds.DateCreated 
 from Documents as d 
 cross apply 
     (select top 1 Status, DateCreated
      from DocumentStatusLogs 
      where DocumentID = d.DocumentId
      order by DateCreated desc) as ds

edited Jul 01 '13 at 05:28

answered Aug 30 '12 at 06:10

kazinix

28,987
33
107
157

Doesn't this give you the date the document was created, not the date the status was created? – BitwiseMan Sep 04 '12 at 19:07
2

That actually makes no difference since the issue is still addressed. – kazinix Sep 05 '12 at 05:57
37

I just posted the results of my timing tests against all of the proposed solutions and yours came out on top. Giving you an up vote :-) – John Fairbanks Mar 07 '15 at 15:00
5

+1 for huge speed improvement. This is much faster than a windowing function such as ROW_NUMBER(). It would be nice if SQL recognized ROW_NUMBER() = 1 like queries and optimized them into Applies. Note: I used OUTER APPLY as I needed results, even if they didn't exist in the apply. – TamusJRoyce Oct 19 '15 at 14:17
I didn't get improvement but idea is interesting – SalientBrain Jan 11 '16 at 09:34
9

@TamusJRoyce you can't extrapolate that just because it was faster once this is always the case. It depends. As described here http://sqlmag.com/database-development/optimizing-top-n-group-queries – Martin Smith Jun 03 '16 at 21:26
2

My comment is about having multiple rows, and only desiring one of those multiple rows per group. Joins are for when you want one to many. Applies are for when you have one to many, but want to filter out all except a one to one. Scenario: For 100 members, give me each their best phone number (where each could have several numbers). This is where Apply excels. Less reads = less disk access = better performance. Given my experience is with poorly designed non-normalized databases. – TamusJRoyce Jun 05 '16 at 17:36
@MartinSmith From your article, "Some solutions work well only when the right indexes are available, but without those indexes the solutions perform badly." - Great point! The above scenario is when you are able to view execution plan and add indexes where needed. If you are not able to add indexes, you will need to do a case-by-case test (which you should probably do anyways). – TamusJRoyce Jun 05 '16 at 17:47
This has to be black magic. This helped me to find the most recent entry and the first entry for each resource returned. Very useful. – Trevor Nestman Jul 18 '17 at 17:48
@dpp your answer is not giving me one row per group. It is returning the whole group? Am I missing something. – Helen Araya Oct 30 '19 at 04:22
6

This works well when you already have a separate `Documents` table that gives one row per group, as desired in the output. But if you're only working with the one table (`DocumentStatusLogs` in this case), you'd first have to do some sort of `DISTINCT` operation on `DocumentID` (or `ROW_NUMBER()`, `MAX(ID)`, etc.), losing all that gained performance. – Extragorey Aug 03 '20 at 04:25
1

This works very well. But keep in mind that if you need to preserve the select results even if APPLY returns empty, then you need to use OUTER APPLY rather than CROSS APPLY. – Turab Jun 02 '21 at 09:52
I just heard about cross apply, the practical of it is kind of confusing. some people comparing this with ``inner join``, question is: ``inner join`` states the ``connectors`` or pk to fk. But ``Cross Apply`` doesn't. How do we grouping it with ``cross apply`` does it automatically detecting the same column name? this thread might me outdated but I really want to know. Thank you for the knowledge @dpp – yuliansen Jul 17 '21 at 04:18
@Extragorey, totally agree. I was having a query for most recent results for a device-command pairs from the 'results' table hanging intermittently - for indefinite time. And the main problem was a full scan of the 'results' table to obtain these pairs. So I just made a new table with device and command columns and unique primary key on both, and then applied dpp's solution. It worked like a charm! – Sergey Nudnov Jan 29 '22 at 00:32

Josh Gilfillan · Answer 3 · 2021-05-01T13:02:55.000

229

I know this is an old thread but the TOP 1 WITH TIES solutions is quite nice and might be helpful to some reading through the solutions.

select top 1 with ties
   DocumentID
  ,Status
  ,DateCreated
from DocumentStatusLogs
order by row_number() over (partition by DocumentID order by DateCreated desc)

The select top 1 with ties clause tells SQL Server that you want to return the first row per group. But how does SQL Server know how to group up the data? This is where the order by row_number() over (partition by DocumentID order by DateCreated desc comes in. The column/columns after partition by defines how SQL Server groups up the data. Within each group, the rows will be sorted based on the order by columns. Once sorted, the top row in each group will be returned in the query.

More about the TOP clause can be found here.

edited May 01 '21 at 13:02

answered Jan 24 '18 at 00:14

Josh Gilfillan

4,348
2
24
26

3

agreed - this best replicates what is very easy to do in other versions of SQL and other languages imo – Chris Umphlett Nov 12 '19 at 21:38
I agree that this is an elegant solution. In my particular query and on SQL Server 2019 this was twice as slow as the cross apply with top 1 solution, but measure for yourself. – N8allan Aug 19 '20 at 23:02
4

Hmm, 'With Ties' might cause more rows to be returned than the value specified in expression (TOP 1). If the OP wants only 1, then you need to remove this phrase, right? – TK Bruin Oct 27 '20 at 17:26
3

@TKBruin that is why the order by row_number() is required. This allows the top record per partition to be retrieved. – Josh Gilfillan Oct 28 '20 at 11:48
1

This deserves to be the best answer... its speed is absolutely compareable to using CTE with window function, but its so much more maintainable... I have hundrets of satellite tables in my DataVault models and with this solution I don't have to retype the attribute projection again and again for each table to get a view on its most recent entry. Also this solution is often faster than joining to the PIT Table to get the latest entries. Truly a gamechanger for me – Jürgen Zornig May 17 '22 at 10:26
Does CROSS APPLY have any use here instead? – Edge Oct 09 '22 at 20:01

score 61 · Answer 4 · answered Mar 07 '15 at 14:57

I've done some timings over the various recommendations here, and the results really depend on the size of the table involved, but the most consistent solution is using the CROSS APPLY These tests were run against SQL Server 2008-R2, using a table with 6,500 records, and another (identical schema) with 137 million records. The columns being queried are part of the primary key on the table, and the table width is very small (about 30 bytes). The times are reported by SQL Server from the actual execution plan.

Query                                  Time for 6500 (ms)    Time for 137M(ms)

CROSS APPLY                                    17.9                17.9
SELECT WHERE col = (SELECT MAX(COL)…)           6.6               854.4
DENSE_RANK() OVER PARTITION                     6.6               907.1

I think the really amazing thing was how consistent the time was for the CROSS APPLY regardless of the number of rows involved.

It all depends on the data distribution and available indexes. It was discussed at great lengths on [dba.se](http://dba.stackexchange.com/questions/86415/retrieving-n-rows-per-group). — Vladimir Baranov, Nov 06 '16 at 10:27

score 42 · Answer 5 · edited Feb 23 '23 at 12:05

42

If you're worried about performance, you can also do this with MAX():

SELECT *
FROM DocumentStatusLogs D
WHERE DateCreated = (SELECT MAX(DateCreated) FROM DocumentStatusLogs WHERE ID = D.ID)

ROW_NUMBER() requires a sort of all the rows in your SELECT statement, whereas MAX does not. Should drastically speed up your query.

edited Feb 23 '23 at 12:05

smonff

3,399
3
36
46

answered Jan 15 '13 at 20:57

Daniel Cotter

1,342
2
15
27

2

Cannot performance issues with ROW_NUMBER() be addressed with proper indexing? (I feel that should be done anyhow) – Kristoffer L Oct 22 '13 at 07:17
12

With datetime, you cannot guarantee two entries won't be added on the same date and time. Precision isn't high enough. – TamusJRoyce Oct 19 '15 at 14:20
1

+1 for simplicity. @TamusJRoyce is right. What about? 'select * from DocumentStatusLog D where ID = (select ID from DocumentsStatusLog where D.DocumentID = DocumentID order by DateCreated DESC limit 1);' – cibercitizen1 Jun 10 '17 at 16:22
SELECT * FROM EventScheduleTbl D WHERE DatesPicked = (SELECT top 1 min(DatesPicked) FROM EventScheduleTbl WHERE EventIDf = D.EventIDf and DatesPicked>= convert(date,getdate()) ) – Arun Prasad E S Feb 01 '18 at 07:58
There are definitely cases where this will outperform `row_number()` even with proper indexing. I find it especially valuable in self-join scenarios. The thing to be cognizant of though, is that this method will often yield a higher number of both logical reads and scan counts, despite reporting a low subtree cost. You'll need to weigh the cost/benefits in your particular case to determine if it's actually better. – pim Aug 02 '18 at 16:03
1

In my case, this approach was SLOWER than using ROW_NUMBER(), due to the introduction of a subquery. You should test different approaches to see what performs best for your data. – Scott Oct 07 '19 at 14:53
I up-voted this answer for its compliance with the SQL standard: this approach also works in databases that are not SQL Server. – Lonnie Best Sep 17 '21 at 06:35

Ariel · Answer 6 · 2011-07-27T08:49:46.493

31

SELECT * FROM
DocumentStatusLogs JOIN (
  SELECT DocumentID, MAX(DateCreated) DateCreated
  FROM DocumentStatusLogs
  GROUP BY DocumentID
  ) max_date USING (DocumentID, DateCreated)

What database server? This code doesn't work on all of them.

Regarding the second half of your question, it seems reasonable to me to include the status as a column. You can leave DocumentStatusLogs as a log, but still store the latest info in the main table.

BTW, if you already have the DateCreated column in the Documents table you can just join DocumentStatusLogs using that (as long as DateCreated is unique in DocumentStatusLogs).

Edit: MsSQL does not support USING, so change it to:

ON DocumentStatusLogs.DocumentID = max_date.DocumentID AND DocumentStatusLogs.DateCreated = max_date.DateCreated

edited Jul 27 '11 at 08:49

answered Jul 27 '11 at 08:44

Ariel

25,995
5
59
69

5

The clue was in the title: MSSQL. SQL Server does not have USING but the idea is OK. – gbn Jul 27 '11 at 08:50
8

@gbn The stupid moderators usually delete important keywords from titles, as they have done here. Making it very difficult to find the correct answers in search results or Google. – NickG Sep 24 '15 at 08:41
2

Jus to point out that this "solution" can still give you multiple records if you have a tie on the `max(DateCreated)` – MoonKnight Nov 16 '17 at 21:40
Remove the Using (in MS SQL ) and complete the Join code , then it would work . – faheem khan Dec 26 '20 at 06:50

score 30 · Answer 7 · edited May 29 '23 at 16:46

30

This is one of the most easily found question on the topic, so I wanted to give a modern answer to the it (both for my reference and to help others out). By using first_value and over you can make short work of the above query:

Select distinct DocumentID
  , first_value(status) over (partition by DocumentID order by DateCreated Desc) as Status
  , first_value(DateCreated) over (partition by DocumentID order by DateCreated Desc) as DateCreated
From DocumentStatusLogs

This should work in SQL Server 2012 and up. First_value can be thought of as a way to accomplish Select Top 1 when using an over clause. Over allows grouping in the select list so instead of writing nested subqueries (like many of the existing answers do), this does it in a more readable fashion.

edited May 29 '23 at 16:46

Thom A

88,727
11
45
75

answered Jan 18 '18 at 00:55

Randall

1,441
13
19

3

This doesn't work in SQL Server 2008 R2. I think first_value was introduced in 2012! – ufo Jan 25 '18 at 08:33
4

Very fast! I was using the Cross Apply solution offered by @dpp, but this one is waaaay faster. – MattSlay Aug 31 '19 at 18:03
For large numbers of columns (Status, DateCreated, etc.), does this do a separate partition/sort for each column, or does it get optimised into one? – Extragorey Aug 03 '20 at 04:47

score 14 · Answer 8 · answered Jun 17 '19 at 05:18

Here are 3 separate approaches to the problem in hand along with the best choices of indexing for each of those queries (please try out the indexes yourselves and see the logical read, elapsed time, execution plan. I have provided the suggestions from my experience on such queries without executing for this specific problem).

Approach 1: Using ROW_NUMBER(). If rowstore index is not being able to enhance the performance, you can try out nonclustered/clustered columnstore index as for queries with aggregation and grouping and for tables which are ordered by in different columns all the times, columnstore index usually is the best choice.

;WITH CTE AS
    (
       SELECT   *,
                RN = ROW_NUMBER() OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC)
       FROM     DocumentStatusLogs
    )
    SELECT  ID      
        ,DocumentID 
        ,Status     
        ,DateCreated
    FROM    CTE
    WHERE   RN = 1;

Approach 2: Using FIRST_VALUE. If rowstore index is not being able to enhance the performance, you can try out nonclustered/clustered columnstore index as for queries with aggregation and grouping and for tables which are ordered by in different columns all the times, columnstore index usually is the best choice.

SELECT  DISTINCT
    ID      = FIRST_VALUE(ID) OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC)
    ,DocumentID
    ,Status     = FIRST_VALUE(Status) OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC)
    ,DateCreated    = FIRST_VALUE(DateCreated) OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC)
FROM    DocumentStatusLogs;

Approach 3: Using CROSS APPLY. Creating rowstore index on DocumentStatusLogs table covering the columns used in the query should be enough to cover the query without need of a columnstore index.

SELECT  DISTINCT
    ID      = CA.ID
    ,DocumentID = D.DocumentID
    ,Status     = CA.Status 
    ,DateCreated    = CA.DateCreated
FROM    DocumentStatusLogs D
    CROSS APPLY (
            SELECT  TOP 1 I.*
            FROM    DocumentStatusLogs I
            WHERE   I.DocumentID = D.DocumentID
            ORDER   BY I.DateCreated DESC
            ) CA;

I have a 100M rows table where I nedded to get both the 1st and the last record for each group. The first two approaches took several minutes to execute. Approach 3 took less than a second. — Pedro Ludovico Bozzini, Mar 19 '21 at 19:31

score 11 · Answer 9 · answered Jun 03 '14 at 08:34

This is quite an old thread, but I thought I'd throw my two cents in just the same as the accepted answer didn't work particularly well for me. I tried gbn's solution on a large dataset and found it to be terribly slow (>45 seconds on 5 million plus records in SQL Server 2012). Looking at the execution plan it's obvious that the issue is that it requires a SORT operation which slows things down significantly.

Here's an alternative that I lifted from the entity framework that needs no SORT operation and does a NON-Clustered Index search. This reduces the execution time down to < 2 seconds on the aforementioned record set.

SELECT 
[Limit1].[DocumentID] AS [DocumentID], 
[Limit1].[Status] AS [Status], 
[Limit1].[DateCreated] AS [DateCreated]
FROM   (SELECT DISTINCT [Extent1].[DocumentID] AS [DocumentID] FROM [dbo].[DocumentStatusLogs] AS [Extent1]) AS [Distinct1]
OUTER APPLY  (SELECT TOP (1) [Project2].[ID] AS [ID], [Project2].[DocumentID] AS [DocumentID], [Project2].[Status] AS [Status], [Project2].[DateCreated] AS [DateCreated]
    FROM (SELECT 
        [Extent2].[ID] AS [ID], 
        [Extent2].[DocumentID] AS [DocumentID], 
        [Extent2].[Status] AS [Status], 
        [Extent2].[DateCreated] AS [DateCreated]
        FROM [dbo].[DocumentStatusLogs] AS [Extent2]
        WHERE ([Distinct1].[DocumentID] = [Extent2].[DocumentID])
    )  AS [Project2]
    ORDER BY [Project2].[ID] DESC) AS [Limit1]

Now I'm assuming something that isn't entirely specified in the original question, but if your table design is such that your ID column is an auto-increment ID, and the DateCreated is set to the current date with each insert, then even without running with my query above you could actually get a sizable performance boost to gbn's solution (about half the execution time) just from ordering on ID instead of ordering on DateCreated as this will provide an identical sort order and it's a faster sort.

score 5 · Answer 10 · answered Sep 23 '12 at 11:22

5

My code to select top 1 from each group

select a.* from #DocumentStatusLogs a where 
 datecreated in( select top 1 datecreated from #DocumentStatusLogs b
where 
a.documentid = b.documentid
order by datecreated desc
)

answered Sep 23 '12 at 11:22

AnuPrakash

51
1
4

score 5 · Answer 11 · edited Jul 23 '20 at 09:32

5

This solution can be used to get the TOP N most recent rows for each partition (in the example, N is 1 in the WHERE statement and partition is doc_id):

SELECT T.doc_id, T.status, T.date_created FROM 
(
    SELECT a.*, ROW_NUMBER() OVER (PARTITION BY doc_id ORDER BY date_created DESC) AS rnk FROM doc a
) T
WHERE T.rnk = 1;

edited Jul 23 '20 at 09:32

Mohamed Thaufeeq

1,667
1
13
34

answered Oct 16 '18 at 08:40

praveen

75
1
2

score 4 · Answer 12 · edited Feb 07 '22 at 18:02

Verifying Clint's awesome and correct answer from above:

The performance between the two queries below is interesting. 52% being the top one. And 48% being the second one. A 4% improvement in performance using DISTINCT instead of ORDER BY. But ORDER BY has the advantage to sort by multiple columns.

IF (OBJECT_ID('tempdb..#DocumentStatusLogs') IS NOT NULL) BEGIN DROP TABLE #DocumentStatusLogs END

CREATE TABLE #DocumentStatusLogs (
    [ID] int NOT NULL,
    [DocumentID] int NOT NULL,
    [Status] varchar(20),
    [DateCreated] datetime
)

INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (2, 1, 'S1', '7/29/2011 1:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (3, 1, 'S2', '7/30/2011 2:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (6, 1, 'S1', '8/02/2011 3:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (1, 2, 'S1', '7/28/2011 4:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (4, 2, 'S2', '7/30/2011 5:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (5, 2, 'S3', '8/01/2011 6:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (6, 3, 'S1', '8/02/2011 7:00:00')

Option 1:

    SELECT
    [Extent1].[ID], 
    [Extent1].[DocumentID],
    [Extent1].[Status], 
    [Extent1].[DateCreated]
FROM #DocumentStatusLogs AS [Extent1]
    OUTER APPLY (
        SELECT TOP 1
            [Extent2].[ID], 
            [Extent2].[DocumentID],
            [Extent2].[Status], 
            [Extent2].[DateCreated]
        FROM #DocumentStatusLogs AS [Extent2]
        WHERE [Extent1].[DocumentID] = [Extent2].[DocumentID]
        ORDER BY [Extent2].[DateCreated] DESC, [Extent2].[ID] DESC
    ) AS [Project2]
WHERE ([Project2].[ID] IS NULL OR [Project2].[ID] = [Extent1].[ID])

Option 2:

SELECT 
    [Limit1].[DocumentID] AS [ID], 
    [Limit1].[DocumentID] AS [DocumentID], 
    [Limit1].[Status] AS [Status], 
    [Limit1].[DateCreated] AS [DateCreated]
FROM (
    SELECT DISTINCT [Extent1].[DocumentID] AS [DocumentID] FROM #DocumentStatusLogs AS [Extent1]
) AS [Distinct1]
    OUTER APPLY  (
        SELECT TOP (1) [Project2].[ID] AS [ID], [Project2].[DocumentID] AS [DocumentID], [Project2].[Status] AS [Status], [Project2].[DateCreated] AS [DateCreated]
        FROM (
            SELECT 
                [Extent2].[ID] AS [ID], 
                [Extent2].[DocumentID] AS [DocumentID], 
                [Extent2].[Status] AS [Status], 
                [Extent2].[DateCreated] AS [DateCreated]
            FROM #DocumentStatusLogs AS [Extent2]
            WHERE [Distinct1].[DocumentID] = [Extent2].[DocumentID]
        )  AS [Project2]
        ORDER BY [Project2].[ID] DESC
    ) AS [Limit1]

In Microsoft SQL Server Management Studio: after highlighting and running the first block, highlight both Option 1 and Option 2, right click -> [Display Estimated Execution Plan]. Then run the entire thing to see the results.

Option 1 Results:

ID  DocumentID  Status  DateCreated
6   1   S1  8/2/11 3:00
5   2   S3  8/1/11 6:00
6   3   S1  8/2/11 7:00

Option 2 Results:

ID  DocumentID  Status  DateCreated
6   1   S1  8/2/11 3:00
5   2   S3  8/1/11 6:00
6   3   S1  8/2/11 7:00

Note:

I tend to use APPLY when I want a join to be 1-to-(1 of many).

I use a JOIN if I want the join to be 1-to-many, or many-to-many.

I avoid CTE with ROW_NUMBER() unless I need to do something advanced and am ok with the windowing performance penalty.

I also avoid EXISTS / IN subqueries in the WHERE or ON clause, as I have experienced this causing some terrible execution plans. But mileage varies. Review the execution plan and profile performance where and when needed!

score 4 · Answer 13 · edited May 12 '20 at 23:36

4

CROSS APPLY was the method I used for my solution, as it worked for me, and for my clients needs. And from what I've read, should provide the best overall performance should their database grow substantially.

edited May 12 '20 at 23:36

Dale K

25,246
15
42
71

answered Feb 12 '20 at 16:54

S8Tony

85
1
1
9

score 2 · Answer 14 · answered Dec 19 '16 at 15:10

2

SELECT o.*
FROM `DocumentStatusLogs` o                   
  LEFT JOIN `DocumentStatusLogs` b                   
  ON o.DocumentID = b.DocumentID AND o.DateCreated < b.DateCreated
 WHERE b.DocumentID is NULL ;

If you want to return only recent document order by DateCreated, it will return only top 1 document by DocumentID

answered Dec 19 '16 at 15:10

cho

61
5

1

This is just going to return everything in the table. – Extragorey Aug 03 '20 at 05:19

Union find · Answer 15 · 2021-12-29T23:44:44.727

1

I believe this can be done just like this. This might need some tweaking but you can just select the max from the group.

These answers are overkill..

SELECT
  d.DocumentID,
  MAX(d.Status),
  MAX(d1.DateCreated)
FROM DocumentStatusLogs d, DocumentStatusLogs d1
USING DocumentID
GROUP BY 1
ORDER BY 3 DESC

edited Dec 29 '21 at 23:44

answered Mar 25 '20 at 13:12

Union find

7,759
13
60
111

Is this t-sql? `Using` isn't supported like that... – PedroC88 Apr 22 '21 at 14:49
mysql 8 should support @PedroC88 – Union find Apr 22 '21 at 15:43
1

Yeah I mention it because the OP specified sql-server – PedroC88 Apr 23 '21 at 16:14
@PedroC88 the question seems to have been changed so that it doesn't reference sql-server anymore. so this is an OK answer. – Union find Apr 23 '21 at 16:29
1

It’s on the tags – PedroC88 Apr 24 '21 at 17:17
@PedroC88 If you look at the other comments, it was originally a question specific to that platform but the question changed. Downvoting in this case misses the mark. – Union find Apr 24 '21 at 21:18
Syntax error. And will not return the _row_ having the latest timestamp (for each id.) – jarlh Dec 29 '21 at 12:40

score 0 · Answer 16 · answered Sep 04 '12 at 20:47

In scenarios where you want to avoid using row_count(), you can also use a left join:

select ds.DocumentID, ds.Status, ds.DateCreated 
from DocumentStatusLogs ds
left join DocumentStatusLogs filter 
    ON ds.DocumentID = filter.DocumentID
    -- Match any row that has another row that was created after it.
    AND ds.DateCreated < filter.DateCreated
-- then filter out any rows that matched 
where filter.DocumentID is null

For the example schema, you could also use a "not in subquery", which generally compiles to the same output as the left join:

select ds.DocumentID, ds.Status, ds.DateCreated 
from DocumentStatusLogs ds
WHERE ds.ID NOT IN (
    SELECT filter.ID 
    FROM DocumentStatusLogs filter
    WHERE ds.DocumentID = filter.DocumentID
        AND ds.DateCreated < filter.DateCreated)

Note, the subquery pattern wouldn't work if the table didn't have at least one single-column unique key/constraint/index, in this case the primary key "Id".

Both of these queries tend to be more "expensive" than the row_count() query (as measured by Query Analyzer). However, you might encounter scenarios where they return results faster or enable other optimizations.

score 0 · Answer 17 · edited Sep 19 '19 at 09:47

0

SELECT documentid, 
       status, 
       datecreated 
FROM   documentstatuslogs dlogs 
WHERE  status = (SELECT status 
                 FROM   documentstatuslogs 
                 WHERE  documentid = dlogs.documentid 
                 ORDER  BY datecreated DESC 
                 LIMIT  1)

edited Sep 19 '19 at 09:47

4b0

21,981
30
95
142

answered Sep 19 '19 at 09:45

Koshal Garg

1

MasterKiller · Answer 18 · 2022-01-31T18:57:54.097

Some database engines* are starting to support the QUALIFY clause that allows to filter the result of window functions (which the accepted answer uses).

So the accepted answer can become

SELECT *, ROW_NUMBER() OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC) AS rn
FROM DocumentStatusLogs
QUALIFY rn = 1

See this article for an in depth explanation: https://jrandrews.net/the-joy-of-qualify

You can use this tool to see which database support this clause: https://www.jooq.org/translate/ There is an option to transform the qualify clause when the target dialect does not support it.

*Teradata, BigQuery, H2, Snowflake...

score -1 · Answer 19 · edited Jan 07 '20 at 18:12

-1

Try this:

SELECT [DocumentID]
    ,[tmpRez].value('/x[2]', 'varchar(20)') AS [Status]
    ,[tmpRez].value('/x[3]', 'datetime') AS [DateCreated]
FROM (
    SELECT [DocumentID]
        ,cast('<x>' + max(cast([ID] AS VARCHAR(10)) + '</x><x>' + [Status] + '</x><x>' + cast([DateCreated] AS VARCHAR(20))) + '</x>' AS XML) AS [tmpRez]
    FROM DocumentStatusLogs
    GROUP BY DocumentID
    ) AS [tmpQry]

edited Jan 07 '20 at 18:12

Suraj Kumar

5,547
8
20
42

answered Nov 05 '16 at 11:57

gng

1
2

3

You should always describe your SQL statement how it will work and solve the OP's query. – Suraj Kumar Jan 07 '20 at 18:10

Get top 1 row of each group

19 Answers19

Linked

Related