Get records with highest/smallest per group

Question

How can one get records with highest/smallest per group?

Former title of this question was "using rank (@Rank := @Rank + 1) in complex query with subqueries - will it work?" because I was looking for solution using ranks, but now I see that the solution posted by Bill is much much better.

Original question:

I'm trying to compose a query that would take last record from each group given some defined order:

SET @Rank=0;

select s.*
from (select GroupId, max(Rank) AS MaxRank
      from (select GroupId, @Rank := @Rank + 1 AS Rank 
            from Table
            order by OrderField
            ) as t
      group by GroupId) as t 
  join (
      select *, @Rank := @Rank + 1 AS Rank
      from Table
      order by OrderField
      ) as s 
  on t.GroupId = s.GroupId and t.MaxRank = s.Rank
order by OrderField

Expression @Rank := @Rank + 1 is normally used for rank, but for me it looks suspicious when used in 2 subqueries, but initialized only once. Will it work this way?

And second, will it work with one subquery that is evaluated multiple times? Like subquery in where (or having) clause (another way how to write the above):

SET @Rank=0;

select Table.*, @Rank := @Rank + 1 AS Rank
from Table
having Rank = (select max(Rank) AS MaxRank
              from (select GroupId, @Rank := @Rank + 1 AS Rank 
                    from Table as t0
                    order by OrderField
                    ) as t
              where t.GroupId = table.GroupId
             )
order by OrderField

Thanks in advance!

more advanced question here http://stackoverflow.com/questions/9841093/how-to-writegreatest-n-per-group-type-query-but-with-additional-conditions/9845109#9845109 — Tomas, Mar 25 '12 at 10:56
Does this answer your question? [Fetch the row which has the Max value for a column](https://stackoverflow.com/questions/121387/fetch-the-row-which-has-the-max-value-for-a-column) — rogerdpack, May 06 '21 at 17:19
[Why the order of evaluation for expressions involving user variables is undefined?](https://stackoverflow.com/a/44751302/3404097) — philipxy, Feb 10 '23 at 07:17

score 198 · Accepted Answer · edited Feb 10 '23 at 07:37

To get the row with the highest OrderField per group:

SELECT t1.*
FROM `Table` AS t1
LEFT OUTER JOIN `Table` AS t2
  ON t1.GroupId = t2.GroupId AND t1.OrderField < t2.OrderField
WHERE t2.GroupId IS NULL
ORDER BY t1.OrderField; // not needed

If there are more records with the same OrderField within the same group and you need exactly one of them, you may want to extend the condition:

SELECT t1.*
FROM `Table` AS t1
LEFT OUTER JOIN `Table` AS t2
  ON t1.GroupId = t2.GroupId 
        AND (t1.OrderField < t2.OrderField 
         OR (t1.OrderField = t2.OrderField AND t1.Id < t2.Id))
WHERE t2.GroupId IS NULL

In other words, return the row t1 for which no other row t2 exists with the same GroupId and a greater OrderField. When t2.* is NULL, it means the left outer join found no such match, and therefore t1 has the greatest value of OrderField in the group.

No ranks, no subqueries. This should run fast and optimize access to t2 with "Using index" if you have a compound index on (GroupId, OrderField).

Regarding performance, see my answer to Retrieving the last record in each group. I tried a subquery method and the join method using the Stack Overflow data dump. The difference is remarkable: the join method ran 278 times faster in my test.

It's important that you have the right index to get the best results!

Regarding your method using the @Rank variable, it won't work as you've written it, because the values of @Rank won't reset to zero after the query has processed the first table. I'll show you an example.

I inserted some dummy data, with an extra field that is null except on the row we know is the greatest per group:

select * from `Table`;

+---------+------------+------+
| GroupId | OrderField | foo  |
+---------+------------+------+
|      10 |         10 | NULL |
|      10 |         20 | NULL |
|      10 |         30 | foo  |
|      20 |         40 | NULL |
|      20 |         50 | NULL |
|      20 |         60 | foo  |
+---------+------------+------+

We can show that the rank increases to three for the first group and six for the second group, and the inner query returns these correctly:

select GroupId, max(Rank) AS MaxRank
from (
  select GroupId, @Rank := @Rank + 1 AS Rank
  from `Table`
  order by OrderField) as t
group by GroupId

+---------+---------+
| GroupId | MaxRank |
+---------+---------+
|      10 |       3 |
|      20 |       6 |
+---------+---------+

Now run the query with no join condition, to force a Cartesian product of all rows, and we also fetch all columns:

select s.*, t.*
from (select GroupId, max(Rank) AS MaxRank
      from (select GroupId, @Rank := @Rank + 1 AS Rank 
            from `Table`
            order by OrderField
            ) as t
      group by GroupId) as t 
  join (
      select *, @Rank := @Rank + 1 AS Rank
      from `Table`
      order by OrderField
      ) as s 
  -- on t.GroupId = s.GroupId and t.MaxRank = s.Rank
order by OrderField;

+---------+---------+---------+------------+------+------+
| GroupId | MaxRank | GroupId | OrderField | foo  | Rank |
+---------+---------+---------+------------+------+------+
|      10 |       3 |      10 |         10 | NULL |    7 |
|      20 |       6 |      10 |         10 | NULL |    7 |
|      10 |       3 |      10 |         20 | NULL |    8 |
|      20 |       6 |      10 |         20 | NULL |    8 |
|      20 |       6 |      10 |         30 | foo  |    9 |
|      10 |       3 |      10 |         30 | foo  |    9 |
|      10 |       3 |      20 |         40 | NULL |   10 |
|      20 |       6 |      20 |         40 | NULL |   10 |
|      10 |       3 |      20 |         50 | NULL |   11 |
|      20 |       6 |      20 |         50 | NULL |   11 |
|      20 |       6 |      20 |         60 | foo  |   12 |
|      10 |       3 |      20 |         60 | foo  |   12 |
+---------+---------+---------+------------+------+------+

We can see from the above that the max rank per group is correct, but then the @Rank continues to increase as it processes the second derived table, to 7 and on higher. So the ranks from the second derived table will never overlap with the ranks from the first derived table at all.

You'd have to add another derived table to force @Rank to reset to zero in between processing the two tables (and hope the optimizer doesn't change the order in which it evaluates tables, or else use STRAIGHT_JOIN to prevent that):

select s.*
from (select GroupId, max(Rank) AS MaxRank
      from (select GroupId, @Rank := @Rank + 1 AS Rank 
            from `Table`
            order by OrderField
            ) as t
      group by GroupId) as t 
  join (select @Rank := 0) r -- RESET @Rank TO ZERO HERE
  join (
      select *, @Rank := @Rank + 1 AS Rank
      from `Table`
      order by OrderField
      ) as s 
  on t.GroupId = s.GroupId and t.MaxRank = s.Rank
order by OrderField;

+---------+------------+------+------+
| GroupId | OrderField | foo  | Rank |
+---------+------------+------+------+
|      10 |         30 | foo  |    3 |
|      20 |         60 | foo  |    6 |
+---------+------------+------+------+

But the optimization of this query is terrible. It can't use any indexes, it creates two temporary tables, sorts them the hard way, and even uses a join buffer because it can't use an index when joining temp tables either. This is example output from EXPLAIN:

+----+-------------+------------+--------+---------------+------+---------+------+------+---------------------------------+
| id | select_type | table      | type   | possible_keys | key  | key_len | ref  | rows | Extra                           |
+----+-------------+------------+--------+---------------+------+---------+------+------+---------------------------------+
|  1 | PRIMARY     | <derived4> | system | NULL          | NULL | NULL    | NULL |    1 | Using temporary; Using filesort |
|  1 | PRIMARY     | <derived2> | ALL    | NULL          | NULL | NULL    | NULL |    2 |                                 |
|  1 | PRIMARY     | <derived5> | ALL    | NULL          | NULL | NULL    | NULL |    6 | Using where; Using join buffer  |
|  5 | DERIVED     | Table      | ALL    | NULL          | NULL | NULL    | NULL |    6 | Using filesort                  |
|  4 | DERIVED     | NULL       | NULL   | NULL          | NULL | NULL    | NULL | NULL | No tables used                  |
|  2 | DERIVED     | <derived3> | ALL    | NULL          | NULL | NULL    | NULL |    6 | Using temporary; Using filesort |
|  3 | DERIVED     | Table      | ALL    | NULL          | NULL | NULL    | NULL |    6 | Using filesort                  |
+----+-------------+------------+--------+---------------+------+---------+------+------+---------------------------------+

Whereas my solution using the left outer join optimizes much better. It uses no temp table and even reports "Using index" which means it can resolve the join using only the index, without touching the data.

+----+-------------+-------+------+---------------+---------+---------+-----------------+------+--------------------------+
| id | select_type | table | type | possible_keys | key     | key_len | ref             | rows | Extra                    |
+----+-------------+-------+------+---------------+---------+---------+-----------------+------+--------------------------+
|  1 | SIMPLE      | t1    | ALL  | NULL          | NULL    | NULL    | NULL            |    6 | Using filesort           |
|  1 | SIMPLE      | t2    | ref  | GroupId       | GroupId | 5       | test.t1.GroupId |    1 | Using where; Using index |
+----+-------------+-------+------+---------------+---------+---------+-----------------+------+--------------------------+

You'll probably read people making claims on their blogs that "joins make SQL slow," but that's nonsense. Poor optimization makes SQL slow.

This may prove quite useful (for the OP too), but, sadly, answers neither of the two questions asked. — Andriy M, Jan 05 '12 at 21:20
Thanks Bill, that's a good idea how to avoid the ranks, but ... wouldn't the join be slow? The join (without the where clause limitation) would be of much larger size than in my queries. Anyway, thanks for the idea! But I would be also interesting in the original question, i.e. if the ranks would work this way. — Tomas, Jan 05 '12 at 23:53
From my test, if there are two order fields, this answer's direction leads to very poor performance. Subquery with max() approach give much better performance, like this: `select * from main_table join (select GroupId, OrderField1, max(OrderField2) as OrderField2 from main_table join (select GroupId, max(OrderField1) as OrderField1 from main_table group by GroupId) as sub1 using (GroupId, OrderField1) group by GroupId) as sub2 using (GroupId, OrderField1, OrderField2) group by GroupId`. For reference, not sure if just my case. Index involved: main_table(GroupId, OrderField2) — Johnny Wong, Jan 04 '17 at 11:14
The first solution in this set has terrible performance -- O(N*N). — Rick James, May 17 '17 at 01:33
@RickJames, check the EXPLAIN I provided. It's a table-scan, but the join is a `ref`. So it's O(N*logN). — Bill Karwin, May 17 '17 at 03:34
@BillKarwin - This may be version-dependent. I looked at the `SESSION STATUS LIKE 'Handler%'` values and got 30M for my 5K test table. — Rick James, May 17 '17 at 03:48
@RickJames, or it may depend on how many distinct groups there are. If there are many groups with few rows per group, it could be a better choice. But few groups with many rows would have to scan a lot of rows. — Bill Karwin, May 17 '17 at 06:25
@BillKarwin - I need to ponder that. I was using 5K+ Canadian cities in 13 provinces ("group") based on population ("order"). — Rick James, May 17 '17 at 15:22
How can I do it with filtering? I mean, I want to know items for each group with highest column1 only among rows filtered by created date, for example. Can I do it without subquery or temporary table as well? — Wood, Nov 08 '22 at 04:45
@Sam, Yes you can, but you have to apply the filtering to the table in both correlations. — Bill Karwin, Nov 08 '22 at 05:30
Beware of self-joins. They are unsuitable if the groups are large, because for each group, they combine each row in that group with each other row in that group. This is quadratic in the group size. A group of 1000 rows will explode to a million. This (possibly absurd) set of rows is _only filtered out at the end_, after it has already been worked on. This is easily observed against two large data sets of the same total size, one with many results (small groups), the other with few results (large groups). The query with larger groups will be far, far slower, despite returning fewer rows. — Timo, Aug 14 '23 at 17:10
@Timo - What you say is true only if the join does not use an index. The EXPLAIN I showed for my self-join solution shows that it is using an index for the join. The `rows` column for the joined table `t2` shows that the optimizer estimates it will examine 1 row, not all 6. — Bill Karwin, Aug 14 '23 at 18:57
@BillKarwin I hope that you are right and MySQL 8 has new optimizations for this. I'm not sure how reliable the estimation is. What estimation do you get with 5.7? All my tests in the past have been quadratic in the group size, measured empirically. Yes, we're talking exclusively about indexed joins. — Timo, Aug 14 '23 at 21:39

score 1 · Answer 2 · edited Feb 10 '23 at 07:26

If you want to do more sophisticated logic than only the first result, like getting only the 2nd result or the first result with a certain constraint:

select *
from (
    select
        [GroupId] , [columnName], OrderField,
        ROW_NUMBER() OVER(PARTITION BY [GroupId] ORDER BY OrderField DESC) AS row_number
    from Table_NAME(nolock)
    where [columnName]!='BadValue'
    ) a
where a.row_number = 1

Get records with highest/smallest per group

2 Answers2

Linked

Related