Finding the highest n values of each group in MySQL

Question

I have some data formatted like this:

Lane         Series
1            680
1            685
1            688
2            666
2            425
2            775
...

And I'd like to grab the highest n series per lane (let's say 2 for the sake of this example, but it could be many more than that)

So the output should be:

Lane         Series
1            688
1            685
2            775
2            666

Getting the highest series per lane is easy, but I can't seem to find a way to get the highest 2 results.

I use a MAX aggregate function with a GROUP BY to get the MAX, but there's no "TOP N" function as in SQL Server and using ORDER BY... LIMIT only returns the highest N results overall, not per lane.

Since I use a JAVA application I coded myself to query the database and choose what N is, I could do a loop and use a LIMIT and loop through every lane, making a different query each time, but I want to learn how to do it using MySQL.

This is very simple in SQL Server by using Partition/rank. Here is a similar question on how to achieve the same functionality using MySQL: http://stackoverflow.com/questions/3333665/mysql-rank-function — rkg, Jun 30 '11 at 22:54

Bohemian · Answer 1 · 2011-07-01T01:08:00.643

See my other answer for the MySQL-only, but very fast, solution.

This solution lets you specify any number of top rows per lane and doesn't use any MySQL "funky" syntax - it should run on most databases.

select lane, series
from lane_series ls
group by lane, series
having (
    select count(*) 
    from lane_series
    where lane = ls.lane
    and series > ls.series) < 2 -- Here's where you specify the number of top rows
order by lane, series desc;

Test output:

create table lane_series (lane int, series int);

insert into lane_series values 
(1, 680),
(1, 685),
(1, 688),
(2, 666),
(2, 425),
(2, 775);

select lane, series
from lane_series ls
group by lane, series
having (select count(*) from lane_series where lane = ls.lane and series > ls.series) < 2
order by lane, series desc;

+------+--------+
| lane | series |
+------+--------+
|    1 |    688 |
|    1 |    685 |
|    2 |    775 |
|    2 |    666 |
+------+--------+
4 rows in set (0.00 sec)

I tested your example on the sample table and it works exactly like it should. I have a problem, though, because it doesn't work on my actual table. I'm pretty sure it's because it is taking too long to return the data I'm looking for because right now, the table that contains the data has 50,000 rows and 8 columns (some of which have strings). Do you have an idea why it works with the sample table, but not with the actual table? — Jumbala, Jun 30 '11 at 23:51
@Adam: OK - see my other answer for a fast solution that will work with very large tables — Bohemian, Jul 01 '11 at 01:08

score 3 · Answer 2 · answered Jun 30 '11 at 23:14

3

This will work, if you know you'll never have ties for first place:

SELECT lane,MAX(series)
FROM scores
GROUP BY lane
UNION 
SELECT s.lane,MAX(s.series)
FROM scores AS s
JOIN (
    SELECT lane,MAX(series) AS series
    FROM scores
    GROUP BY lane
) AS x ON (x.lane = s.lane)
WHERE s.series <> x.series
GROUP BY s.lane;

answered Jun 30 '11 at 23:14

Jonathan Hall

75,165
16
143
189

This won't do because I sometimes have ties for first place. Thanks anyway! – Jumbala Jun 30 '11 at 23:52

Bohemian · Accepted Answer · 2011-07-01T01:48:04.693

This solution is the fastest for MySQL and will work with very large tables, but it uses "funky" MySQL features, so wouldn't be of use for other database flavours.

(Edited to sort before applying logic)

set @count:=-1, @lane:=0; 
select lane, series
from (select lane, series from lane_series order by lane, series desc) x
where if(lane != @lane, @count:=-1, 0) is not null
and if(lane != @lane, @lane:=lane, lane) is not null
and (@count:=@count+1) < 2; -- Specify the number of row at top of each group here

To put this query on steroids, define an index on lane and series: CREATE INDEX lane_series_idx on lane_series(lane, series); and it will do (super fast) index-only scan - so your other text columns don't affect it.

Good points of this query are:

It requires only one table pass (albeit sorted)
It handles ties at any level, for example if there's a tie for 2nd, only one of the 2nd will be displayed - ie the row count is absolute and never exceeded

Here's the test output:

create table lane_series (lane int, series int);

insert into lane_series values (1, 680),(1, 685),(1, 688),(2, 666),(2, 425),(2, 775);

-- Execute above query:

+------+--------+
| lane | series |
+------+--------+
|    1 |    688 |
|    1 |    685 |
|    2 |    775 |
|    2 |    666 |
+------+--------+

Great, this works, except that it returns the lowest two values (just look at the output in your example). I tested it with MySQL and it works flawlessly. I just have to adapt it to SQLite now for an offline version on another computer at work, which I thought had the same syntax as MySQL. Thanks to your example, it should be easy to adapt, thanks again! If you could just please change it to make it return the first two values instead of the last two, I'll mark your answer as accepted. — Jumbala, Jul 01 '11 at 01:36
@Adam: OK Fixed now. I needed to sort *before* the logic, so I used an aliases query. Cheers. — Bohemian, Jul 01 '11 at 01:48
Great! I'll have to study this to know how it actually works, but it works! Thanks again! — Jumbala, Jul 01 '11 at 01:52

score 0 · Answer 4 · answered Jul 01 '11 at 23:19

I think @Bohemian's generic answer can also be written as a join rather than a subquery, though it probably doesn't make much difference:

select ls1.lane, ls1.series
from lane_series ls1 left join lane_series ls2 on lane
where ls1.series < ls2.series
group by ls1.lane, ls1.series
having count(ls2.series) < 2 -- Here's where you specify the number of top rows
order by ls1.lane, ls1.series desc;

Finding the highest n values of each group in MySQL

4 Answers4

Linked