Syntactic vs logical order of operations
I think that the confusion around the relationship between DISTINCT
and ORDER BY
(or also GROUP BY
, for that matter), can only really be understood if the logical order of operations in SQL is understood. It is different from the syntactic order of operations, which is the primary source of confusion.
In this example, it looks as though DISTINCT
is related to SELECT
, given its syntactic closeness, but it's really an operator that is applied after SELECT
(the projection). Due to the nature of what DISTINCT
does (remove duplicate rows), all the not-projected contents of a row are no longer available after the DISTINCT
operation, which includes the ORDER BY
clause. According to the logical order of operations (simplified):
FROM
(produces all possible column references)
WHERE
(can use all column references from FROM
)
SELECT
(can use all column references from FROM
, and create new expressions, and alias them)
DISTINCT
(operates on the tuple projected by SELECT
)
ORDER BY
(depending on the presence of DISTINCT
, can operate on the tuple projected by SELECT
, and if DISTINCT
is absent *perhaps (depending on the dialect) also on other expressions)
What about DISTINCT
and ORDER BY
The fact that, without DISTINCT
, ORDER BY
can access (in some dialects) also things that haven't been projected may be a bit weird, certainly useful. E.g. this works:
WITH emp (id, fname, name) AS (
VALUES (1, 'A', 'A'),
(2, 'C', 'A'),
(3, 'B', 'B')
)
SELECT id
FROM emp
ORDER BY fname DESC
dbfiddle here. Producing
id
--
2
3
1
This changes when you add DISTINCT
. This no longer works:
WITH emp (id, fname, name) AS (
VALUES (1, 'A', 'A'),
(2, 'C', 'A'),
(3, 'B', 'B')
)
SELECT DISTINCT name
FROM emp
ORDER BY fname DESC
dbfiddle here. The error being:
ERROR: for SELECT DISTINCT, ORDER BY expressions must appear in select list
LINE 8: ORDER BY fname DESC
Because what fname
value would you attribute to name = A
? A
or C
? The answer would decide whether you'd be getting A
, B
as a result or B
, A
. It cannot be decided.
PostgreSQL DISTINCT ON
Now, as mentioned in the above linked article, PostgreSQL supports an exception to this, which can occasionally be useful: DISTINCT ON
(see also questions like these):
WITH emp (id, fname, name) AS (
VALUES (1, 'A', 'A'),
(2, 'C', 'A'),
(3, 'B', 'B')
)
SELECT DISTINCT ON (name) id, fname, name
FROM emp
ORDER BY name, fname, id
dbfiddle here, producing:
id |fname|name
---|-----|----
1 |A |A
3 |B |B
This query allows to produce only distinct values of name
, and then per duplicate row, take the first one given the ORDER BY
clause, which makes the choice per distinct group unambiguous. This can be emulated in other RDBMS using window functions.