SQL join: selecting the last records in a one-to-many relationship

Question

Suppose I have a table of customers and a table of purchases. Each purchase belongs to one customer. I want to get a list of all customers along with their last purchase in one SELECT statement. What is the best practice? Any advice on building indexes?

Please use these table/column names in your answer:

customer: id, name
purchase: id, customer_id, item_id, date

And in more complicated situations, would it be (performance-wise) beneficial to denormalize the database by putting the last purchase into the customer table?

If the (purchase) id is guaranteed to be sorted by date, can the statements be simplified by using something like LIMIT 1?

Yes, it might be worth denormalizing (if it improves performance a lot, which you can only find out by testing both versions). But the downsides of denormalization are usually worth avoiding. — Vince Bowdren, Jan 21 '10 at 21:46
Related: http://jan.kneschke.de/projects/mysql/groupwise-max/ — igorw, May 25 '11 at 18:26

score 652 · Accepted Answer · edited Aug 25 '19 at 02:23

652

This is an example of the greatest-n-per-group problem that has appeared regularly on StackOverflow.

Here's how I usually recommend solving it:

SELECT c.*, p1.*
FROM customer c
JOIN purchase p1 ON (c.id = p1.customer_id)
LEFT OUTER JOIN purchase p2 ON (c.id = p2.customer_id AND 
    (p1.date < p2.date OR (p1.date = p2.date AND p1.id < p2.id)))
WHERE p2.id IS NULL;

Explanation: given a row p1, there should be no row p2 with the same customer and a later date (or in the case of ties, a later id). When we find that to be true, then p1 is the most recent purchase for that customer.

Regarding indexes, I'd create a compound index in purchase over the columns (customer_id, date, id). That may allow the outer join to be done using a covering index. Be sure to test on your platform, because optimization is implementation-dependent. Use the features of your RDBMS to analyze the optimization plan. E.g. EXPLAIN on MySQL.

Some people use subqueries instead of the solution I show above, but I find my solution makes it easier to resolve ties.

edited Aug 25 '19 at 02:23

MarredCheese

17,541
8
92
91

answered Jan 21 '10 at 17:35

Bill Karwin

538,548
86
673
828

3

Favorably, in general. But that depends on the brand of database you use, and the quantity and distribution of data in your database. The only way to get a precise answer is for you to test both solutions against your data. – Bill Karwin Jan 21 '10 at 18:48
43

If you want to include customers who never made a purchase, then change JOIN purchase p1 ON (c.id = p1.customer_id) to LEFT JOIN purchase p1 ON (c.id = p1.customer_id) – GordonM Dec 08 '10 at 13:41
I've implemented this solution along with the LEFT JOIN change from @GordonM. My issue now, is what if i have 2 identical rows. Is there a way to limit this to just return 1 row (doesn't matter which one)? Great discussion btw. – russds Nov 07 '12 at 20:21
5

@russds, you need some unique column you can use to resolve the tie. It makes no sense to have two identical rows in a relational database. – Bill Karwin Nov 07 '12 at 23:26
Example with subquery is more preferrable for speed – Yaroslav May 20 '15 at 09:03
10

What is the purpose of "WHERE p2.id IS NULL"? – b.lyte Jun 10 '15 at 02:03
1

@clu: see explanation in my answer to http://stackoverflow.com/questions/121387/fetch-the-row-which-has-the-max-value-for-a-column/123481#123481 – Bill Karwin Jun 10 '15 at 22:17
@Yaroslav which example? and how do you know? – Matthew James Davis Sep 19 '15 at 10:45
8

this solution works only, if there are more than 1 purchase records. ist there is 1:1 link, it does NOT work. there it has to be "WHERE (p2.id IS NULL or p1.id=p2.id) – Bruno Jennrich Jul 23 '17 at 19:22
Why not do `p1.date <= p2.date` and get rid of the `or` in the outer join? I'm applying this to hive, where `or` isn't allowed in a join. The `or` seems redundant in any case. – Someguywhocodes May 09 '19 at 14:08
@Someguywhocodes, You should review operator precedence of boolean operations. `A OR B AND C` is not the same as `(A OR B) AND C`. – Bill Karwin May 09 '19 at 15:30
1

worked well for me and this has the fastest performance compared to other examples on this thread for the data set i have – Joel_J Aug 17 '19 at 19:28
4

@b.lit I believe the purpose of the "WHERE p2.id IS NULL" is to isolate the last record in the purchase table. When we reach the end of the table, p1 points to the last record and p2 points to the next record. The last record has no next record so the id of that record is null. – Patrick Dec 03 '20 at 21:35
1

@Patrick, You are correct, but FYI you are replying to a comment from 2015, so if that user needed to know the answer, they probably have gotten it by now. :-) – Bill Karwin Dec 03 '20 at 21:39
@BrunoJennrich I found out it only works if there is more then 1 record in the one to many relationship. I did the query and since many of mine only have 1 record it was missing a lot – carnini Jan 27 '22 at 22:23
I tried your solution and found it ve-e-ery slow for large tables with billions of rows. – gsa Apr 12 '22 at 20:42
@gsa Was any other solution not very slow for a table with billions of rows? – Bill Karwin Apr 12 '22 at 21:04
@BillKarwin Unfortunately, no. Thank you, your solution works well for relatively small tables. For example, it worked well for tables with thousands of rows. But it starts to work very slow for tables with hundreds of millions rows. I work with Postgresql and use indexes, of course. – gsa Apr 13 '22 at 11:22
1

@gsa Well one thing I've learned from a lot of performance consulting is that as the scale of your application gets larger, different solutions must be used. The guideline we used at Percona was that for every order of magnitude (10x) growth in traffic or data size, you need to reevaluate your architecture and decide if it's time to refactor it to some different implementation. There is hardly ever a "one size fits all" solution. – Bill Karwin Apr 13 '22 at 12:07
What about performance between subquery or your query ? – Zanyar Jalal Jun 15 '22 at 06:52
1

@ZanyarJ.Ahmed https://stackoverflow.com/questions/1313120/retrieving-the-last-record-in-each-group-mysql/1313293#1313293 – Bill Karwin Jun 15 '22 at 15:56

score 192 · Answer 2 · answered Jan 21 '10 at 17:40

192

You could also try doing this using a sub select

SELECT  c.*, p.*
FROM    customer c INNER JOIN
        (
            SELECT  customer_id,
                    MAX(date) MaxDate
            FROM    purchase
            GROUP BY customer_id
        ) MaxDates ON c.id = MaxDates.customer_id INNER JOIN
        purchase p ON   MaxDates.customer_id = p.customer_id
                    AND MaxDates.MaxDate = p.date

The select should join on all customers and their Last purchase date.

answered Jan 21 '10 at 17:40

Adriaan Stander

162,879
31
289
284

6

@clu: Change the `INNER JOIN` to a `LEFT OUTER JOIN`. – Sasha Chedygov Oct 05 '15 at 17:12
8

Looks like this assumes there is only one purchase on that day. If there were two you would get two output rows for one customer, I think? – artfulrobot Jun 12 '17 at 13:40
Why can't we do without the last `INNER JOIN` ? – Istiaque Ahmed Nov 09 '17 at 14:13
1

@IstiaqueAhmed - the last INNER JOIN takes that Max(date) value and ties it back to the source table. Without that join, the only information you would have from the `purchase` table are the date and the customer_id, but the query asks for all fields from the table. – Laughing Vergil Apr 22 '19 at 21:16
What if there are customers that don't have a purchase yet... does this remove them from the list because it is an inner join? – trilogy Jan 25 '23 at 20:52

Stefan Haberl · Answer 3 · 2013-06-12T08:21:00.230

56

Another approach would be to use a NOT EXISTS condition in your join condition to test for later purchases:

SELECT *
FROM customer c
LEFT JOIN purchase p ON (
       c.id = p.customer_id
   AND NOT EXISTS (
     SELECT 1 FROM purchase p1
     WHERE p1.customer_id = c.id
     AND p1.id > p.id
   )
)

edited Jun 12 '13 at 08:21

answered Jun 12 '13 at 08:10

Stefan Haberl

9,812
7
72
81

1

Can you explain the `AND NOT EXISTS` part in easy words ? – Istiaque Ahmed Nov 09 '17 at 14:24
1

The sub select just checks if there‘s a row with a higher id. You’ll only get a row in your result set, if none with higher id is found. That should be the unique highest one. – Stefan Haberl Nov 13 '17 at 00:30
2

When Id is a uniqueidentifier (guid), this cannot be used. – andrew.fox Jul 02 '20 at 11:46
1

@andrew.fox: Correct. You definitely need a consistent way to establish which record was inserted last. This will fail with the ID, if your IDs are random and not incremental. In that case replace the `>` comparison of ids with some timestamp column, but be sure to put an index on that column for performance reasons. – Stefan Haberl Jul 02 '20 at 14:32
1

@andrew.fox Note, that the OP stated: "If the (purchase) id is guaranteed to be sorted by date..." – Stefan Haberl Jul 02 '20 at 14:35

Tate Thurston · Answer 4 · 2018-06-21T21:40:21.383

If you're using PostgreSQL you can use DISTINCT ON to find the first row in a group.

SELECT customer.*, purchase.*
FROM customer
JOIN (
   SELECT DISTINCT ON (customer_id) *
   FROM purchase
   ORDER BY customer_id, date DESC
) purchase ON purchase.customer_id = customer.id

PostgreSQL Docs - Distinct On

Note that the DISTINCT ON field(s) -- here customer_id -- must match the left most field(s) in the ORDER BY clause.

Caveat: This is a nonstandard clause.

score 35 · Answer 5 · answered Jan 21 '10 at 18:23

35

You haven't specified the database. If it is one that allows analytical functions it may be faster to use this approach than the GROUP BY one(definitely faster in Oracle, most likely faster in the late SQL Server editions, don't know about others).

Syntax in SQL Server would be:

SELECT c.*, p.*
FROM customer c INNER JOIN 
     (SELECT RANK() OVER (PARTITION BY customer_id ORDER BY date DESC) r, *
             FROM purchase) p
ON (c.id = p.customer_id)
WHERE p.r = 1

answered Jan 21 '10 at 18:23

Madalina Dragomir

441
5
7

13

This is the wrong answer to the question because you are using "RANK()" instead of "ROW_NUMBER()". RANK will still give you the same problem of ties when two purchases have the exact same date. That's what the Ranking function does; if the top 2 match, they both get assigned the value of 1 and the 3rd record gets a value of 3. With Row_Number, there is no tie, it is unique for the entire partition. – MikeTeeVee Jan 15 '12 at 20:31
4

Trying Bill Karwin's approach against Madalina's approach here, with execution plans enabled under sql server 2008 I found Bill Karwin's apprach had a query cost of 43% as opposed to Madalina's approach which used 57%- so despite the more elegant syntax of this answer, I would still favour Bill's version! – Shawson Apr 12 '12 at 12:01
Yes, there must be ‘ROW_NUMBER()’ instead. However, nice work @Madalina – Adrian Covaci Aug 17 '23 at 12:15

score 34 · Answer 6 · answered Oct 23 '14 at 06:29

34

I found this thread as a solution to my problem.

But when I tried them the performance was low. Bellow is my suggestion for better performance.

With MaxDates as (
SELECT  customer_id,
                MAX(date) MaxDate
        FROM    purchase
        GROUP BY customer_id
)

SELECT  c.*, M.*
FROM    customer c INNER JOIN
        MaxDates as M ON c.id = M.customer_id

Hope this will be helpful.

answered Oct 23 '14 at 06:29

Mathee

691
7
16

1

to get only 1 i used `top 1` and `ordered it by` MaxDate `desc` – Roshna Omer Jan 01 '18 at 10:12
2

this is easy and straightforward solution, in MY case (many customers, few purchases) 10% faster then @Stefan Haberl's solution and more than 10 times better than accepted answer – Juraj Bezručka May 22 '18 at 10:27
1

Great suggestion using common table expressions (CTE) to solve this problem. This has dramatically improved the performance of queries in many situations. – AdamsTips Jul 19 '18 at 18:33
1

Best answer imo, easy to read, the MAX() clause gives great performance comparted to ORDER BY + LIMIT 1 – mrj Feb 28 '20 at 15:14
3

Wrong answer. It provides just latest date column from the table `purchase`. OP was asking for the whole record – Sergey Nudnov Jan 12 '21 at 16:25
Only provides the date not the whole record – carnini Jan 27 '22 at 22:35

score 13 · Answer 7 · edited May 23 '19 at 19:27

13

Try this, It will help.

I have used this in my project.

SELECT 
*
FROM
customer c
OUTER APPLY(SELECT top 1 * FROM purchase pi 
WHERE pi.customer_id = c.Id order by pi.Id desc) AS [LastPurchasePrice]

edited May 23 '19 at 19:27

Sarvar Nishonboyev

12,262
10
69
70

answered Jan 16 '18 at 05:07

Rahul Murari

439
1
5
16

Where do the alias "p" comes from? – TiagoA Apr 10 '19 at 15:46
1

this doesnt perform well.... took forever where other examples here took 2 seconds on the data set i have.... – Joel_J Aug 17 '19 at 19:24
This was the most performant option for my data set. – primo Sep 02 '21 at 05:21

celsowm · Answer 8 · 2021-04-08T23:33:42.557

On SQL Server you could use:

SELECT *
FROM customer c
INNER JOIN purchase p on c.id = p.customer_id
WHERE p.id = (
    SELECT TOP 1 p2.id
    FROM purchase p2
    WHERE p.customer_id = p2.customer_id
    ORDER BY date DESC
)

SQL Server Fiddle: http://sqlfiddle.com/#!18/262fd/2

On MySQL you could use:

SELECT c.name, date
FROM customer c
INNER JOIN purchase p on c.id = p.customer_id
WHERE p.id = (
    SELECT p2.id
    FROM purchase p2
    WHERE p.customer_id = p2.customer_id
    ORDER BY date DESC
    LIMIT 1
)

MySQL Fiddle: http://sqlfiddle.com/#!9/202613/7

score 6 · Answer 9 · answered Mar 23 '21 at 22:07

I needed what you needed, albeit many years later, and tried the two most popular answers. These did not yield the desired fruit. So this is what I have to offer... For clarity, I changed some names.

SELECT 
  cc.pk_ID AS pk_Customer_ID, 
  cc.Customer_Name AS Customer_Name, 
  IFNULL(pp.pk_ID, '') AS fk_Purchase_ID,
  IFNULL(pp.fk_Customer_ID, '') AS fk_Customer_ID,
  IFNULL(pp.fk_Item_ID, '') AS fk_Item_ID,
  IFNULL(pp.Purchase_Date, '') AS Purchase_Date
FROM customer cc
LEFT JOIN purchase pp ON (
  SELECT zz.pk_ID 
  FROM purchase zz 
  WHERE cc.pk_ID = zz.fk_Customer_ID 
  ORDER BY zz.Purchase_Date DESC LIMIT 1) = pp.pk_ID
ORDER BY cc.pk_ID;

I have a condition where I have to join many tables and there are at 2 where I used one to many relationship. this actually solved my problem — Mislam, Aug 05 '21 at 11:02

Mark · Answer 10 · 2018-01-27T04:39:28.380

5

Tested on SQLite:

SELECT c.*, p.*, max(p.date)
FROM customer c
LEFT OUTER JOIN purchase p
ON c.id = p.customer_id
GROUP BY c.id

The max() aggregate function will make sure that the latest purchase is selected from each group (but assumes that the date column is in a format whereby max() gives the latest - which is normally the case). If you want to handle purchases with the same date then you can use max(p.date, p.id).

In terms of indexes, I would use an index on purchase with (customer_id, date, [any other purchase columns you want to return in your select]).

The LEFT OUTER JOIN (as opposed to INNER JOIN) will make sure that customers that have never made a purchase are also included.

edited Jan 27 '18 at 04:39

answered Jan 27 '18 at 04:25

Mark

7,446
5
55
75

2

wont run in t-sql as the select c.* has columns not in the group by clause – Joel_J Aug 17 '19 at 19:30
2

I also find this works in SQLite. I scoured it documentation (which is extremely comprehensive) for some note saying that it should work but couldn't find anything. So there's no guarantee that it will work in future updates (unless you can find something I've missed). – Arthur Tacca Aug 05 '20 at 21:51

score 2 · Answer 11 · edited Jun 25 '16 at 09:50

2

Please try this,

SELECT 
c.Id,
c.name,
(SELECT pi.price FROM purchase pi WHERE pi.Id = MAX(p.Id)) AS [LastPurchasePrice]
FROM customer c INNER JOIN purchase p 
ON c.Id = p.customerId 
GROUP BY c.Id,c.name;

edited Jun 25 '16 at 09:50

Sandeep

1,504
7
22
32

answered Jun 25 '16 at 09:25

Milad Shahbazi

144
1
2
7

score 2 · Answer 12 · answered Aug 13 '21 at 16:47

Tables :

Customer => id, name
Purchase => id, customer_id, item_id, date

Query :

SELECT C.id, C.name, P.id, P.date
  FROM customer AS C
  LEFT JOIN purchase AS P ON 
    (
      P.customer_id = C.id 
      AND P.id IN (
        SELECT MAX(PP.id) FROM purchase AS PP GROUP BY PP.customer_id
      )
    )

You can also specify some condition into sub select query

Gary Bao 鲍昱彤 · Answer 13 · 2020-10-05T20:27:34.753

Without getting into the code first, the logic/algorithm goes below:

Go to the transaction table with multiple records for the same client.
Select records of clientID and the latestDate of client's activity using group by clientID and max(transactionDate)
```
   select clientID, max(transactionDate) as latestDate 
   from transaction 
   group by clientID
```

inner join the transaction table with the outcome from Step 2, then you will have the full records of the transaction table with only each client's latest record.

   select * from 
   transaction t 
   inner join (
     select clientID, max(transactionDate) as latestDate
     from transaction 
     group by clientID) d 
   on t.clientID = d.clientID and t.transactionDate = d.latestDate)

You can use the result from step 3 to join any table you want to get different results.

SQL join: selecting the last records in a one-to-many relationship

13 Answers13

Linked

Related