Using aggregate function without selecting the aggregate column and without subquery

Question

My data looks like this:

id  name       score 
--------------------
a   apple        0.2
a   apple        0.7
a   apple        1.1
a   banana       1.2
b   cherry       0.8
b   lemon        0.9
c   mango        2.4
c   raspberry    1.9
d   strawberry   0.7
d   lemon        1.1

For each id, I want to select the row with the highest score, but only the id and name:

id  name
----------
a   banana
b   lemon
c   mango
d   lemon

The following query does the job with sqlite. (It is explained in this answer, why this query is actually invalid in most DBMSs):

SELECT id, name from (SELECT id, name, max(score) from data group by id);

The question is: Is this possible without a subquery?

Notes: I am using sqlite at the moment, but I am looking for a portable solution. If there are only vendor-specific solutions, it is also a valid answer. This question is similar, but the necessity of subqueries is not discussed.

However, there is a solution for postgresql using `DISTINCT ON`: https://stackoverflow.com/questions/3800551/select-first-row-in-each-group-by-group/7630564#7630564 — Radim Bača, Feb 21 '18 at 09:04
What's the expected result of two different fruits for an id have the same highest score? — jarlh, Feb 21 '18 at 09:13
The example query in the question is essentially just a `SELECT DISTINCT id, name FROM data` Why do you need the subquery for it? Or if you are interested in the name for the maximum in the group, that query is not reliable and it is just a lucky coincidence that you got the actual values. — Pred, Feb 21 '18 at 09:14
`SELECT id, name, max(score) from data group by id` is invalid SQL and most definitely **not** portable. — , Feb 21 '18 at 09:16
@a_horse_with_no_name Please see the notes. I am aware that SQL dialects vary across DBMS prodcuts. But a lot of queries will work across many products. I want to know If there is one for this particular use case. — code_onkel, Feb 21 '18 at 09:17
@jarlh Having the same score is unlikely in the real case and if it occurs, it does not matter which row is selected. — code_onkel, Feb 21 '18 at 09:20
@Pred The real data originates from a join already, this is why the IDs are not unique. So it is not equivalent to the select you suggested (this is just a coincidence in the example, it will edit it). — code_onkel, Feb 21 '18 at 09:22
@code_onkel: It is still equivalend, or as I said, it is just a coincidence that you got the correct name, because without an explicitly specified ordering, the order will be determined by the way the dbms accesses the data, therefore most rdbms will either prohibit columns in the select list which are not aggregated nor listed in the group by (that query is not portable because of this). When the rdbms allows orphan columns in a grouped query the retrieved data is either not determined or there are strict rules when it is allowed and can be considered as deterministic. — Pred, Feb 21 '18 at 09:29
To find vendor-specific answers simply look at the various solutions to questions tagged with [tag:greatest-n-per-group] — , Feb 21 '18 at 09:29

score 2 · Accepted Answer · 2018-02-21T09:28:17.480

A portable solution implies standard SQL. In standard SQL this is typically solved using window functions.

select id, name
from (
   select id, name, dense_rank() over (partition by id order by score desc) as rnk
   from the_table
) t 
where rnk = 1;

The above is standard SQL and works on essentially every modern DBMS (even MariaDB and the upcoming MySQL 8.0). However, I don't think SQLite supports window functions.

You original sub-query:

SELECT id, name, max(score) 
from data 
group by id

is invalid standard SQL because the name column is neither part of the GROUP BY nor used in an aggregate function. The query will be rejected by essentially every other DBMS - including newer versions of MySQL where ONLY_FULL_GROUP_BY is turned on by default. Apparently SQLite allows this invalid grouping resulting in non-deterministic (=random) results.

The only exception to that rule is if the the grouping all non-grouping columns have a known functional dependency to the grouping column. That means if the grouping column is a primary key and all non-grouped columns belong to the table of that primary key. To my knowledge only Postgres currently supports this.

Could you please elaborate why the query in my question is invalid? — code_onkel, Feb 21 '18 at 09:21

Using aggregate function without selecting the aggregate column and without subquery

1 Answers1