Why does SQL not have a straightforward way of querying using aggregate functions?

Question

It is an incredibly common scenario where I want to select the row (or rows) that have either the maximum or minimum value for some column - often a datetime stamp of some kind. It would seem logical that a simple way to do this would be something like this:

SELECT *
FROM MyTable
WHERE DateColumn = MAX(DateColumn)

This, of course, is not allowed. Aggregate functions are not allowed in a WHERE clause (though I don't know why, exactly). One could use a HAVING clause, but this doesn't actually work either:

SELECT *
FROM MyTable
HAVING DateColumn = MAX(DateColumn)

Instead, the only solutions seem to be some variation of a subquery, something like this:

SELECT *
FROM MyTable
WHERE DateColumn = (
  SELECT MAX(DateColumn)
  FROM MyTable
)

Why is such a common need made so complicated? The intent of both of my examples above seems quite obvious, so why can't the SQL compiler be made to understand them? Or, if there is some technical reason why the existing implementation of WHERE cannot handle this syntax, why has no simple syntax been added to the language? I run into this particular need very frequently, and I see from searching online that I am hardly the only one. It would seem like the language should have accounted for this LONG ago, but it never has. Is there some serious technical or logical limitation I am missing here that makes this unrealistic?

I don't have time to answer in full, but in short it's because doing that doesn't really translate to a set operation and requires two passes over the data, one to find the aggregate and the second to filter. — Eterm, Aug 02 '22 at 15:56
You can also use a CTE or Derived table with a windowed `MAX`; a subquery isn't the only solution. — Thom A, Aug 02 '22 at 16:00
Though tagged with [[tag:oracle]], this is still relevant: [Why are aggregate functions not allowed in where clause](https://stackoverflow.com/questions/42470849/why-are-aggregate-functions-not-allowed-in-where-clause) — Thom A, Aug 02 '22 at 16:05
Why not just `SELECT TOP(1)` ordered by whatever you want the max/min of? — HoneyBadger, Aug 02 '22 at 16:05
Some RDBMSs have proprietary syntax to (potentially) simplify this (Teradata's qualify) — Andrew, Aug 02 '22 at 17:03
You need to think of the `WHERE` as operating per row. It's the equivalent in say C# of doing `list.Where(x => x.DateColumn == Max(x.DateColumn)` which makes no sense. You need to first make a pass to get the max, then another pass to filter. Obviously in this particular instance, you can also just do a `TOP (1) WITH TIES` — Charlieface, Aug 03 '22 at 13:17
@Charlieface Doesn't seem like you read my working example with the subquery. You also seem to assume I want only one row. The equivalent Link statement would be: list.Where(x => x.DateColumn == list.Max(y => y.DateColumn)), which makes perfect sense. Of course, this is essentially a subquery again. Ultimately, the underlying implementation in SQL would likely be something similar to a subquery, but the utility of such simplified syntax seems like it would be obvious. — Sean Worle, Aug 03 '22 at 16:05
@HoneyBadger "Why not just SELECT TOP(1) ordered by whatever you want the max/min of?" How would you know beforehand that there is only one row with the max value? My example will pull multiple rows, if multiple rows exist that meet the condition. — Sean Worle, Aug 03 '22 at 16:07
`TOP (1) WITH TIES` will return multiple rows if there are more than one which tie for the max. Perhaps it would be useful, but it wouldn't be part of the `WHERE`. In databases that support this, it's in a separate `QUALIFY` clause which comes after `WHERE` and `GROUP BY` and `HAVING` but before `ORDER BY` and `OFFSET` — Charlieface, Aug 03 '22 at 16:08
@Larnu "You can also use a CTE or Derived table..." Certainly true. All essentially the same concept - one has to construct multiple queries to achieve the result, which seems like something the language could easily have handled in a way that allows for simplified syntax. — Sean Worle, Aug 03 '22 at 16:09
The language isn't at fault here, @SeanWorle, it's your understanding (of the language) that is. — Thom A, Aug 03 '22 at 16:12
I agree that `QUALIFY` would be useful, you can upvote the suggestion https://feedback.azure.com/d365community/idea/3dcd2bb1-6125-ec11-b6e6-000d3a4f0da0 — Charlieface, Aug 03 '22 at 16:15

marcothesane · Answer 1 · 2022-08-02T17:22:51.113

-1

It's SQL. It's set oriented. And it's only ordered if you deliberately state that - and how - you want your data ordered.

And there is, in every DBMS I know, a means to limit the number of rows returned.

In SQL Server, your query would need to be:

SELECT TOP(1)
  *
FROM mytable
ORDER BY datecolumn DESC;

And just because we can: let's assume you want the newest row for each ID. There, decent DBMS-s have OLAP functions like ROWNUM() OVER() to help you with that task.

Input would be a table with a bunch of identifiers and several different dates per identifier. And you want the newest for each identifier. See here: ...

WITH
indata(id,dt) AS (
            SELECT 1, DATE '2022-01-01'
  UNION ALL SELECT 1, DATE '2022-01-02'
  UNION ALL SELECT 2, DATE '2022-01-01'
  UNION ALL SELECT 2, DATE '2022-01-02'
  UNION ALL SELECT 3, DATE '2022-01-01'
  UNION ALL SELECT 3, DATE '2022-01-02'
  UNION ALL SELECT 4, DATE '2022-01-01'
  UNION ALL SELECT 4, DATE '2022-01-02'
)
,
w_rownum AS (
  SELECT
    ROW_NUMBER() OVER(PARTITION BY id ORDER BY dt DESC) AS rn
  , * 
  FROM indata
)
SELECT
  id  
, dt
FROM w_rownum
WHERE rn=1
ORDER BY id; 
-- out  id |     dt     
-- out ----+------------
-- out   1 | 2022-01-02
-- out   2 | 2022-01-02
-- out   3 | 2022-01-02
-- out   4 | 2022-01-02

edited Aug 02 '22 at 17:22

answered Aug 02 '22 at 16:49

marcothesane

6,192
1
11
21

I would say it's "bag oriented". Even though the term "set" is widely used in the SQL Standard, rows processed and returned can have duplicates. – The Impaler Aug 02 '22 at 16:53
1

You must assume that datecolumn is unique, otherwise TOP 1 simply returns the first (of potentially many) row that has the maximum value. Logically a very different query. – SMor Aug 02 '22 at 17:03
Absolutely. All queries in the question also assume the same .... – marcothesane Aug 02 '22 at 17:13
2

@SMor you can always use `WITH TIES` – HoneyBadger Aug 02 '22 at 17:17
1

@TheImpaler The technical term is *multi-set* – Charlieface Aug 02 '22 at 23:59
@Charlieface You are absolutely right. SQL-92 in section 3.1.p – The Impaler Aug 03 '22 at 13:07
Of course SQL is set oriented - I WANT a set. This solution doesn't actually do what I would be expecting. If you use Top(1), then you only get 1 row - whichever row you specified in the sorting. Take a look at my example working solution which uses a subquery - it will return the set of all rows where the date column is the maximum value for that column. This COULD be a single row, but doesn't have to be. I want to know why this common need isn't supported in SQL more directly, without having to build awkward subqueries to handle it. – Sean Worle Aug 03 '22 at 15:56
@marcothesane "Absolutely. All queries in the question also assume the same" No, they sure don't. My example working query will pull a set if there are more than 1 row matching the max value. This answer does NOT actually address the question I asked. – Sean Worle Aug 03 '22 at 15:58
Look at my second answer. Yes, it does need a subquery - if your DBMS does not support the QUALIFY clause - but it's more efficient than grouping in any case. You might also want to read Kimball's publications on Slowly Changing Dimensions: There, he maintains a `valid_from`/ `valid_to` date pair, plus an `is_current` indicator Boolean. `valid_to` is usually equal to the next `valid_from` of the same identifier, and equal to `9999-12-31` for the newest row. And the newest row also is the only one where `is_current` is true. – marcothesane Aug 03 '22 at 16:04

Why does SQL not have a straightforward way of querying using aggregate functions?

1 Answers1