2

With over 1800 tagged questions for greatest-n-per-group and some excellent answers, I thought I'd find a solution to this one - but I've either missed the solution or I need a new approach.

I have a table photo_types to store votes by user who are voting (up or down) on what particular photo_type they think a given photo is. Photo types are 1-10 and each vote will be 1 or -1.

+----+-----+-----------+------------+------+
| id | user | photo_id | photo_type | vote |
+----+------+----------+------------+------+
|  1 | jane |   photo1 |          1 |    1 |
|  2 | jane |   photo2 |          2 |    1 |
|  3 | jane |   photo3 |          4 |   -1 |
|  4 |  ben |   photo1 |          1 |    1 |
|  5 |  ben |   photo2 |          3 |   -1 |
|  6 |  ben |   photo2 |          2 |    1 |
|  7 | mary |   photo1 |          1 |   -1 |
|  8 | mary |   photo3 |         10 |    1 |
|  9 | mary |   photo2 |          1 |    1 |
| 10 | mary |   photo1 |          2 |   -1 |
+----+------+----------+------------+------+

I need to left join this table back to a photos table (that has all the other details of a given photo) - but only include the top 2 voted types for each photo.

The photos table to which I need to LEFT JOIN the photo_types table looks like:

+----+----------+------------+----------------+---------------+------------+
| id | photo_id | photo_name |   photographer |      location |       date |
+----+----------+------------+----------------+---------------+------------+
|  1 |   photo1 | the bridge |    Bill Murray |  Brooklyn, NY | 2012-10-11 |
|  2 |   photo2 |    the cat | Jacques Chirac | Paris, France | 2013-01-03 |
|  3 |   photo3 |      a car |     the Grinch |    London, UK | 2016-09-01 |
+----+----------+------------+----------------+---------------+------------+

I'm obviously joining the two tables via photo_id.

To get the top voted types for each photo I have tried a sub-query like this:

SELECT photo_id, photo_type, sum(vote) AS votes
FROM photo_types
GROUP BY photo_type, photo_id
HAVING votes>0
ORDER BY votes DESC

which groups the sum of the votes by photo_type as well as photo_id.
This works fine but includes all types with sum(vote) > 0 - not just the top 2 voted types.
SQL Fiddle here

When included in the join it looks like:

SELECT * 
FROM photos
LEFT JOIN
    (SELECT photo_id, photo_type, sum(vote) AS votes
    FROM photo_types
    GROUP BY photo_type, photo_id
    HAVING votes>0
    ORDER BY votes DESC) AS pt
ON photos.photo_id = pt.photo_id
WHERE photos.date > '2010-01-01';

SQL Fiddle here

I had hoped to use Bill Karwin's solution but I'm having trouble joining the table to itself based on the grouped values (which is a SUM in my case). The sub-query I tried looked like :

SELECT pt1.*, SUM(pt1.vote) AS votes1, SUM(pt2.vote) AS votes2
FROM photo_types AS pt1
LEFT OUTER JOIN photo_types AS pt2
    ON pt1.photo_id = pt2.photo_id
        AND (votes1 < votes2
        OR (votes1 = votes2 AND pt1.id < pt2.id))
WHERE pt2.photo_id IS NULL

...which doesn't work because it's an attempt to join two tables on a calculated value (unlike Bill's solution).
SQL Fiddle here

Question
Is there a way to get the greatest-n-per-group when the grouping is based on a calculated values such as SUM(xxx) ?

Solutions which partially cover this are here and here but don't include aggregates in the grouped values.

The other obvious way to do it is to simply re-calculate the highest voted values each time a vote is placed and store this directly in the photos table - as discussed here - but unless it's impossible - I'd prefer to calculate within the SELECT for various reasons.

Community
  • 1
  • 1
goredwards
  • 2,486
  • 2
  • 30
  • 40

3 Answers3

1

If you have a limited list, the simplest way is the substring_index()/group_concat() trick:

SELECT photo_id,
       SUBSTRING_INDEX(GROUP_CONCAT(photo_type ORDER BY votes DESC), ',', 2) as top2
FROM (SELECT photo_id, photo_type, sum(vote) AS votes
      FROM photo_types
      GROUP BY photo_type, photo_id
      HAVING votes > 0
     ) pt
GROUP BY photo_id;

Notes:

  • The intermediate string for group_concat() is about 1k -- which is more than enough for this problem.
  • The alternatives (as you have discovered) either use variables for much more complex queries.
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • thanks - this does work - http://sqlfiddle.com/#!9/2029d8/11 - but then won't work with a std `WHERE photo_type IN (array)` if I need to select based on `photo_type` - and probably requires a `REGEXP`. Any way to avoid comma separated values in the results ? – goredwards Jan 05 '17 at 03:47
0

Look up xxx apply functions. They give you much more flexibility than just doing sub aggregate queries.

http://sqlserverplanet.com/sql-2005/cross-apply-explained
user3112728
  • 395
  • 1
  • 12
  • thanks, interesting article but unfortunately `CROSS APPLY` doesn't exist in MySQL - and not entirely sure how to apply the approximation as described here: http://stackoverflow.com/questions/36869221/cross-outer-apply-in-mysql – goredwards Jan 05 '17 at 05:16
0

OK so from this old blog post (mentioned a few times in other greatest-n-per-group solutions), the following works :

SELECT pt1.*
FROM 
  (SELECT id, photo_id, photo_type, sum(vote) AS votes
  FROM photo_types
  GROUP BY photo_type, photo_id
  HAVING votes>0) AS pt1
WHERE (
  SELECT COUNT(*) 
  FROM 
    (SELECT id, photo_id, photo_type, sum(vote) AS votes
    FROM photo_types
    GROUP BY photo_type, photo_id
    HAVING votes>0) AS pt2
  WHERE pt1.photo_id = pt2.photo_id and pt1.votes <= pt2.votes
) <=2
ORDER BY photo_id, votes DESC

see SqlFiddle here

However:
- not sure how efficient it is since it uses two subqueries
- won't return correct numbers of results if any of the greatest-n have identical values (since that pushes the count outside the specified limit) - as you can see in this SqlFiddle

goredwards
  • 2,486
  • 2
  • 30
  • 40