39

I have a database of items. Each item is categorized with a category ID from a category table. I am trying to create a page that lists every category, and underneath each category I want to show the 4 newest items in that category.

For Example:

Pet Supplies

img1
img2
img3
img4

Pet Food

img1
img2
img3
img4

I know that I could easily solve this problem by querying the database for each category like so:

SELECT id FROM category

Then iterating over that data and querying the database for each category to grab the newest items:

SELECT image FROM item where category_id = :category_id 
ORDER BY date_listed DESC LIMIT 4

What I'm trying to figure out is if I can just use 1 query and grab all of that data. I have 33 categories so I thought perhaps it would help reduce the number of calls to the database.

Anyone know if this is possible? Or if 33 calls isn't that big a deal and I should just do it the easy way.

gung - Reinstate Monica
  • 11,583
  • 7
  • 60
  • 79
justinl
  • 10,448
  • 21
  • 70
  • 88
  • How "static" are your categories? Is it a list that changes every now and then or is it constant? – David Andres Sep 18 '09 at 04:09
  • the categories are very static (rarely will change). They won't ever really change unless I add a category which I don't think will happen or will be very rare – justinl Sep 18 '09 at 04:11
  • @justinl: if they're static, you're best off with a simple UNION statement. See my answer for an example. – David Andres Sep 18 '09 at 04:14
  • @justinl suggested title for question: "MySql, A JOIN B: how to limit to N rows from B, for each PK from A ?" – mjv Sep 18 '09 at 04:48
  • You can use windowing functionality explained here https://stackoverflow.com/a/38854846/2723942 – Paramvir Singh Karwal Feb 01 '22 at 14:50

8 Answers8

97

This is the greatest-n-per-group problem, and it's a very common SQL question.

Here's how I solve it with outer joins:

SELECT i1.*
FROM item i1
LEFT OUTER JOIN item i2
  ON (i1.category_id = i2.category_id AND i1.item_id < i2.item_id)
GROUP BY i1.item_id
HAVING COUNT(*) < 4
ORDER BY category_id, date_listed;

I'm assuming the primary key of the item table is item_id, and that it's a monotonically increasing pseudokey. That is, a greater value in item_id corresponds to a newer row in item.

Here's how it works: for each item, there are some number of other items that are newer. For example, there are three items newer than the fourth newest item. There are zero items newer than the very newest item. So we want to compare each item (i1) to the set of items (i2) that are newer and have the same category as i1. If the number of those newer items is less than four, i1 is one of those we include. Otherwise, don't include it.

The beauty of this solution is that it works no matter how many categories you have, and continues working if you change the categories. It also works even if the number of items in some categories is fewer than four.


Another solution that works but relies on the MySQL user-variables feature:

SELECT *
FROM (
    SELECT i.*, @r := IF(@g = category_id, @r+1, 1) AS rownum, @g := category_id
    FROM (@g:=null, @r:=0) AS _init
    CROSS JOIN item i
    ORDER BY i.category_id, i.date_listed
) AS t
WHERE t.rownum <= 3;

MySQL 8.0.3 introduced support for SQL standard window functions. Now we can solve this sort of problem the way other RDBMS do:

WITH numbered_item AS (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY category_id ORDER BY item_id) AS rownum
  FROM item
)
SELECT * FROM numbered_item WHERE rownum <= 4;
Bill Karwin
  • 538,548
  • 86
  • 673
  • 828
  • Hi Bill. Thanks that works great! Something I didn't mention in the question, is that I only want to return results that have an image, if the image is blank, then I don't want it to be returned. How can I put a WHERE clause in there so that it doesn't return an image if the item.image = '' – justinl Sep 18 '09 at 07:06
  • Ah I just figured it out. I put a "WHERE image <> '' " above the GROUP BY line – justinl Sep 18 '09 at 07:10
  • 1
    FYI: If you want to constrain against other table columns you have to do so in the ON brackets, and using a WHERE just above the GROUP BY eg: ON (i2.active = TRUE) WHERE i1.active = TRUE – justinl Sep 29 '09 at 05:17
  • If you want this to scale down to top 1 item, then the clause on the item ids needs to be `i1.item_id <= i2.item_id`. – porges Mar 11 '10 at 03:30
  • `LEFT OUTER JOIN item i2 USING (category_id, item_id)` – tereško Jun 02 '12 at 10:53
  • Bill- we applied this solution to one of [my problems](http://stackoverflow.com/a/12114175/165673), but it failed to return n in the case of a tie among the top values. See [demo](http://sqlfiddle.com/#!2/b6ce1/3) – Yarin Aug 24 '12 at 18:46
  • @Yarin, yes, when you have ties, you have to have some way of breaking ties in the JOIN condition. Otherwise use a solution involving the row number per group, like the solution given by bluefeet in your other post. MySQL doesn't support it, but other databases support SQL *windowing functions* so you can get the row number per partition in a result set. See http://www.postgresql.org/docs/9.1/static/tutorial-window.html – Bill Karwin Aug 24 '12 at 19:11
  • Should the JOIN condition be "ON (i1.category_id = i2.category_id AND i1.date_listed < i2.listed) " ? – xiaolong Aug 19 '13 at 22:26
  • @pdxhiker, I'm assuming two things: item_id has the same sort order as date_listed, and item_id is unique. But if either of those assumptions is not correct, then you're right, it would be better to use the condition as you describe. And you'd need some extra term in the join expression to account for ties. – Bill Karwin Aug 19 '13 at 22:44
  • Great answer! Thanks! Regarding the first query, if I wanted to generalize it for N, rather than 4, I should replace the HAVING line with HAVING COUNT(i2.item_id) < N, right? I mean, to make it more general is better to replace the star by a columns of the righter table. Am I right? – drake Jul 08 '15 at 06:19
  • @drake, no, you can use COUNT(*). To generalize it, use a different value besides 4. – Bill Karwin Jul 08 '15 at 06:26
  • Suppose N=1, because it is a left outer join, the HAVING COUNT(*) < 1 condition is impossible to meet since there will always be a row with non-null values for the left table columns and null values for the right table columns. What am I missing? Why doesn't the HAVING COUNT line count that row? – drake Jul 08 '15 at 06:33
  • 1
    @drake, you're right about that. But for finding the top 1 per group, there is another query style that's even more efficient, because it can do the task without using GROUP BY at all. See for example my answer in http://stackoverflow.com/questions/121387/fetch-the-row-which-has-the-max-value-for-a-column – Bill Karwin Jul 08 '15 at 14:55
  • Thanks for replying and the link! Does COUNT(*) have any efficiency advantage over COUNT(specific column)? – drake Jul 08 '15 at 15:39
  • 1
    @drake, in my experience, any difference is very slight. You can benchmark it yourself to be sure. In general, you should use COUNT(column) for the logical reason - when you want the count to skip rows where the column is NULL. Whereas COUNT(*) counts all rows, whether the column is null or not. – Bill Karwin Jul 08 '15 at 22:41
  • Version 8 ! What happened to 6 & 7? Can't wait for window functions, writing reporting and analytic queries without them is not fun. I look at things like that `cross join + where clause` and that triangular join using `>` and I know the query optimizer won't be able to do anything about the memory blowout. Although perhaps the window functions can't be optimized beyond the query plan of that triangular join anyway so it might be just sugar. – Davos Jan 15 '18 at 14:59
  • 1
    @Davos: https://dev.mysql.com/doc/refman/8.0/en/faqs-general.html#faq-mysql-why-8.0 – Bill Karwin Jan 15 '18 at 16:43
  • Great answer i assume *"I'm assuming the primary key of the item table is item_id, and that it's a monotonically increasing pseudokey. That is, a greater value in item_id corresponds to a newer row in item."* is the hard way of trying to tell that `item_id` ideally should have a `AUTO_INCREMENT` option?? – Raymond Nijland May 14 '19 at 15:22
  • 1
    @RaymondNijland, Yes, MySQL's AUTO_INCREMENT is a monotonically increasing pseudokey. Other SQL implementations use terms like SEQUENCE, IDENTITY, etc. – Bill Karwin May 14 '19 at 15:25
  • i know mine comment was more meant to be a rhetorical question actually, i wanted more of less make a statement that it ideally shouldn't be a application generated pseudokey.. sorry for the confusion – Raymond Nijland May 14 '19 at 15:30
5

This solution is an adaptation from another SO solution, thank you RageZ for locating this related/similar question.

NOTE

This solution seems satisfactory for Justin's use case. Depending on your use case you may want to check Bill Karwin or David Andres' solutions in this posting. Bill's solution has my vote! See why, as I put both queries next to one another ;-)

The benefit of my solution is that it returns one record per category_id (the info from the item table is "rolled-up"). The main drawback of my solution is its lack of readability and its growing complexity as the number of desired rows grows (say to have 6 rows per category rather than 6). Also it may be slightly slower as the number of rows in the item table grows. (Regardless, all solutions will perform better with a smaller number of eligible rows in the item table, and it is therefore advisable to either periodically delete or move older items and/or to introduce a flag to help SQL filter out rows early)

First try (didn't work!!!)...

The problem with this approach was that the subquery would [rightfully but bad for us] produce very many rows, based on the cartesian products defined by the self joins...

SELECT id, CategoryName(?), tblFourImages.*
FROM category
JOIN (
    SELECT i1.category_id, i1.image as Image1, i2.image AS Image2, i3.image AS Image3, i4.image AS Image4
    FROM item AS i1
    LEFT JOIN item AS i2 ON i1.category_id = i2.category_id AND i1.date_listed > i2.date_listed
    LEFT JOIN item AS i3 ON i2.category_id = i3.category_id AND i2.date_listed > i3.date_listed
    LEFT JOIN item AS i4 ON i3.category_id = i4.category_id AND i3.date_listed > i4.date_listed
) AS tblFourImages ON tblFourImages.category_id = category.id
--WHERE  here_some_addtional l criteria if needed
ORDER BY id ASC;

Second try. (works ok!)

A WHERE clause in added for the subquery, forcing the date listed to be the latest, second latest, thrird lateest etc. for i1, i2, i3 etc. respectively (and also allowing for the null cases when there are fewer than 4 items for a given category id). Also added was unrelated filter clauses to prevent showing entries that are "sold" or entries that do not have an image (added requirements)

This logic makes the assumption that there are no duplicate date listed values (for a given category_id). Such cases would otherwise create duplicate rows. Effectively this use of the date listed is that of a monotonically incremented primary key as defined/required in Bill's solution.

SELECT id, CategoryName, tblFourImages.*
FROM category
JOIN (
    SELECT i1.category_id, i1.image as Image1, i2.image AS Image2, i3.image AS Image3, i4.image AS Image4, i4.date_listed
    FROM item AS i1
    LEFT JOIN item AS i2 ON i1.category_id = i2.category_id AND i1.date_listed > i2.date_listed AND i2.sold = FALSE AND i2.image IS NOT NULL
          AND i1.sold = FALSE AND i1.image IS NOT NULL
    LEFT JOIN item AS i3 ON i2.category_id = i3.category_id AND i2.date_listed > i3.date_listed AND i3.sold = FALSE AND i3.image IS NOT NULL
    LEFT JOIN item AS i4 ON i3.category_id = i4.category_id AND i3.date_listed > i4.date_listed AND i4.sold = FALSE AND i4.image IS NOT NULL
    WHERE NOT EXISTS (SELECT * FROM item WHERE category_id = i1.category_id AND date_listed > i1.date_listed)
      AND (i2.image IS NULL OR (NOT EXISTS (SELECT * FROM item WHERE category_id = i1.category_id AND date_listed > i2.date_listed AND date_listed <> i1.date_listed)))
      AND (i3.image IS NULL OR (NOT EXISTS (SELECT * FROM item WHERE category_id = i1.category_id AND date_listed > i3.date_listed AND date_listed <> i1.date_listed AND date_listed <> i2.date_listed)))
      AND (i4.image IS NULL OR (NOT EXISTS (SELECT * FROM item WHERE category_id = i1.category_id AND date_listed > i4.date_listed AND date_listed <> i1.date_listed AND date_listed <> i2.date_listed AND date_listed <> i3.date_listed)))
) AS tblFourImages ON tblFourImages.category_id = category.id
--WHERE  --
ORDER BY id ASC;

Now... compare the following where I introduce an item_id key and use Bill's solution to provide the list of these to the "outside" query. You can see why Bill's approach is better...

SELECT id, CategoryName, image, date_listed, item_id
FROM item I
LEFT OUTER JOIN category C ON C.id = I.category_id
WHERE I.item_id IN 
(
SELECT i1.item_id
FROM item i1
LEFT OUTER JOIN item i2
  ON (i1.category_id = i2.category_id AND i1.item_id < i2.item_id
      AND i1.sold = 'N' AND i2.sold = 'N'
      AND i1.image <> '' AND i2.image <> ''
      )
GROUP BY i1.item_id
HAVING COUNT(*) < 4
)
ORDER BY category_id, item_id DESC
Community
  • 1
  • 1
mjv
  • 73,152
  • 14
  • 113
  • 156
  • Now I get: #1054 - Unknown column 'date_listed' in 'order clause' If I remove the date_listed from the ORDER clause it does work, but it seems to not iterate over the different categories, but instead just lists out the same category over and over again – justinl Sep 18 '09 at 05:46
  • Okay i got date_listed figured out (I just added it to the JOIN's subquery like we did with the category_id). But each row of the returned result is showing the same categoryName, ID, and image path – justinl Sep 18 '09 at 05:50
  • haha it's so close. but the rows that are returned are all from the same category (even though I have half a dozen items in different categories). – justinl Sep 18 '09 at 06:20
  • Actually, I feel bad, I got you on this track, but there's a flaw with the design. Basically the subquery produces [rightfully but bad for us] a whole slew of rows from the cartesian product expressed by the self joins. Another side issue, which we can address once this issue is solved, is that as written now, there could not be any two record in image table with same date _and_ same category_id... – mjv Sep 18 '09 at 06:54
  • No worries about my time. It's a bit like a challenge, plus a nice ego check, when "simple" stuff like that ends up blowing up in my face... I'll give it another 30 minutes... – mjv Sep 18 '09 at 07:05
  • Ok, Justin. I got it. a few more minutes... One question: can you confirm that for a given category_id, the date_listed for all items will differ (by at least a second) ? – mjv Sep 18 '09 at 07:30
  • I can't confirm it exactly because the items are user generated. but the date_listed is logged by seconds, so it won't happen too often. And if it does, I don't mind losing that item and it just shows the next one – justinl Sep 18 '09 at 07:35
  • I should also note that there are 2 other parameters that I will be checking for on each item. First is that each item has a bool sold column. I don't want to show an item if it's sold. Also, I don't want to have items returned that have no images. The WHERE clause would be something like WHERE sold = 0 AND image <> '' – justinl Sep 18 '09 at 07:39
  • Ok, Justin, I added the sold and image constraint. I added these at the level of each join, for symmetry/consistency, except for i1 table which is in the where clause. ATTENTION, being on mySQL, the NOT NULL test may need to be a bit different (but easy to change) – mjv Sep 18 '09 at 07:50
  • @justin, with your ok, I'll delete some of these comments (mine), not so much to erase evidence of this long trek, but to clean things up. I think we should somehow qualify this solution as workable, with friendly output etc., but otherwise honor Bill's simpler solution. – mjv Sep 18 '09 at 07:53
  • Btw it sounds like your not using mySQL to test this on your end. On my end I get the following error: #1064 - You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'NOT EXISTS (SELECT * FROM item WHERE category_id = i1.category_id AND date' at line 10 – justinl Sep 18 '09 at 08:03
  • OK, the not showing any item if an item had "sold" is fixed. It is updated in the response code. Basically the i1.sold=true and i1.image is not null clause was moved from the where to the join condition with i2. – mjv Sep 18 '09 at 08:46
  • @justin Thank you for your kind acceptance. Rewarding effort not brains, in this case ;-) I edited my posting as so to nudge future readers away from my not so elegant solution. Unless one wishes to use the roll-up logic and is sure to ever need only the top 3 or 4 items for each group, there is not point to self impose such abuse. 't was fun none the less, I hope I didn't waste too much of _your_ time. – mjv Sep 18 '09 at 13:57
  • @justin Pls do kill some of your comments here, for clean up / clarity purposes. – mjv Sep 18 '09 at 13:59
3

In other databases you can do this using the ROW_NUMBER function.

SELECT
    category_id, image, date_listed
FROM
(
    SELECT
        category_id, image, date_listed,
        ROW_NUMBER() OVER (PARTITION BY category_id
                           ORDER BY date_listed DESC) AS rn
    FROM item
) AS T1
WHERE rn <= 4

Unfortunately MySQL does not support the ROW_NUMBER function, but you can emulate it using variables:

SELECT
    category_id, image, date_listed
FROM
(
    SELECT
        category_id, image, date_listed,
        @rn := IF(@prev = category_id, @rn + 1, 1) AS rn,
        @prev := category_id
    FROM item
    JOIN (SELECT @prev := NULL, @rn = 0) AS vars
    ORDER BY category_id, date_listed DESC
) AS T1
WHERE rn <= 4

See it working online: sqlfiddle

It works as follows:

  • Intially @prev is set to NULL, and @rn is set to 0.
  • For each row we see, check if the category_id is the same as the previous row.
    • If yes, increment the row number.
    • Otherwise start a new category and reset the row number back to 1.
  • When the subquery completes, the final step is to filter so that only rows with row number less than or equal to 4 are kept.
Mark Byers
  • 811,555
  • 193
  • 1,581
  • 1,452
  • Fortunately MySQL 8.0 will support [windowed functions](https://dev.mysql.com/doc/refman/8.0/en/window-functions-usage.html) – Lukasz Szozda Nov 30 '17 at 20:01
0

Depending on how constant your categories are, the following is the simplest route

SELECT C.CategoryName, R.Image, R.date_listed
FROM
(
    SELECT CategoryId, Image, date_listed
    FROM 
    (
      SELECT CategoryId, Image, date_listed
      FROM item
      WHERE Category = 'Pet Supplies'
      ORDER BY date_listed DESC LIMIT 4
    ) T

    UNION ALL

    SELECT CategoryId, Image, date_listed
    FROM
    (        
      SELECT CategoryId, Image, date_listed
      FROM item
      WHERE Category = 'Pet Food'
      ORDER BY date_listed DESC LIMIT 4
    ) T
) RecentItemImages R
INNER JOIN Categories C ON C.CategoryId = R.CategoryId
ORDER BY C.CategoryName, R.Image, R.date_listed
David Andres
  • 31,351
  • 7
  • 46
  • 36
  • Thanks David. So is this way of combining all the queries into 1 big query more efficient than doing 33 separate queries (1 for each category)? – justinl Sep 18 '09 at 04:20
  • Yes, it can be, if only for the fact that you're probably doing your 33 separate queries as separate requests from the database. Some of that time is spent simply shuttling data back and forth to/from the database server. I've also modified the UNION to a UNION ALL, which does not check for and remove duplicates. You probably wouldn't have any in any case. – David Andres Sep 18 '09 at 04:23
  • Thanks. You are correct that I won't have any duplicates because all the items have a PK. Also it seems like I could just build a query by querying all the category ID's and then building a query by iterating over those results and combining them into a string and using that string as the new query. – justinl Sep 18 '09 at 04:26
  • If that's what you want to do. I say why bother, particularly if you're telling me that category changes don't happen often. If that's the case, copy and paste. When categories change you can come back to this query and make the appropriate modifications. It won't be automatic, but it will work. – David Andres Sep 18 '09 at 04:28
  • I just realized I don't understand in your query how to JOIN the categories. E.g. How do those SELECT statements know what Category is? Because the category ID and name is in another table. – justinl Sep 18 '09 at 04:28
  • Your OP has the category id and image in the item table, so that's what I worked off of. – David Andres Sep 18 '09 at 04:47
  • Thanks David. I don't understand this line: WHERE C.CategoryId = :category_id. What does :category_id refer to? I used that in my example as a placeholder for the category_id. But I thought this query would span across all categories, and thus, wouldn't need a WHERE statement – justinl Sep 18 '09 at 05:06
  • I'm also receiving the error #1221 - Incorrect usage of UNION and ORDER BY – justinl Sep 18 '09 at 05:09
  • Sorry, ORDER BY statements should only appear after UNION/UNION ALL (my mistake). Due to this, I had to structure the query I little differently. I had the WHERE :category_id line to show how you could filter this if you wanted. – David Andres Sep 18 '09 at 12:33
0

the code below shows a way to do it in a loop it definetely needs a lot of editing, but i hope it helps.

        declare @RowId int
 declare @CategoryId int
        declare @CategoryName varchar(MAX)

 create table PART (RowId int, CategoryId int, CategoryName varchar)
 create table  NEWESTFOUR(RowId int, CategoryId int, CategoryName varchar, Image image)
        select RowId = ROW_NUMBER(),CategoryId,CategoryName into PART from [Category Table]


        set @PartId = 0
 set @CategoryId = 0 
 while @Part_Id <= --count
 begin
   set @PartId = @PartId + 1
          SELECT @CategoryId = category_id, @CategoryName = category_name from PART where PartId = @Part_Id
          SELECT RowId = @PartId, image,CategoryId = @category_id, CategoryName = @category_name   FROM item into NEWESTFOUR where category_id = :category_id 
ORDER BY date_listed DESC LIMIT 4

 end
 select * from NEWESTFOUR
 drop table NEWESTFOUR
        drop table PART
Paul Maxwell
  • 33,002
  • 3
  • 32
  • 51
0

Recently I came across a similar situation, I tried a query that worked for me which is independent on database

SELECT i.* FROM Item AS i JOIN Category c ON i.category_id=c.id WHERE
(SELECT count(*) FROM Item i1 WHERE 
i1.category_id=i.category_id AND 
i1.date_listed>=i.date_listed) <=3 
ORDER BY category_id,date_listed DESC;

It is equivalent to running 2 for loops and checking if items newer than this are less than 3

rakesh
  • 4,368
  • 1
  • 19
  • 13
-1

not very pretty but:

SELECT image 
FROM item 
WHERE date_listed IN (SELECT date_listed 
                      FROM item 
                      ORDER BY date_listed DESC LIMIT 4)
tster
  • 17,883
  • 5
  • 53
  • 72
-2

ok after a googling the quick answer would it's not possible at least on mysql

this this thread for reference

maybe you should cache the result of that query if you are afraid to make fall down the server and you want the code to perform more well

Community
  • 1
  • 1
RageZ
  • 26,800
  • 12
  • 67
  • 76