Let's decompose two queries against two tables, both containing two columns. First, we'll do a simple one:
SELECT t1.a,t2.d + 6 as e
FROM
table1 t1
inner join
table2 t2
on
t1.a = t2.c
WHERE
t1.b = 2
ORDER BY
t2.c
And lets consider what is "in scope" as we complete each clause:
FROM table1 t1
- at this point, we have a result set containing two columns - {t1.a, t1.b}
.
INNER JOIN table2 t2 ON ...
- we now have a result set containing four columns - T1.a, t1.b, t2.c, t2.d}
. We may personally also now that a
and c
are equal but that's irrelevant for the analysis.
WHERE
- although WHERE
can filter rows from a query, it doesn't change the set of columns making up the result set - it's still {t1.a, t1.b, t2.c, t2.d}
.
SELECT
- we don't have a GROUP BY
clause, and so the job of the SELECT
clause here is a) to mark some columns for output and b) possibly to add some additional columns whose values are computed. That's what we have here. We end up with a set of {O(t1.a), t1.b, t2.c, t2.d, O(e = t2.d +6)}
1.
ORDER BY
- now we order by t2.c
, which is still in scope despite the fact that it won't be output
finally, the outputs of this query are delivered (technically via a cursor) and just contains {a, e}
. The columns no longer have their "originating table" associated with them, and the non-output columns disappear into the ether.
SELECT
t1.a,SUM(t2.d) as e
FROM
table1 t1
inner join
table2 t2
on
t1.a = t2.c
GROUP BY t1.a
HAVING e > 5
ORDER BY t1.a
The FROM
/JOIN
clauses are identical to previously and so the same analysis prevails. Similarly we have no WHERE
clause but it's irrelevant to the set of columns. We have {t1.a, t1.b, t2.c, t2.d}
.
SELECT
/GROUP BY
/DISTINCT
. DISTINCT
and GROUP BY
are really the same thing - both identify a set of columns either explicitly (GROUP BY
) or by their existing in the SELECT
clause. You cannot untie SELECT
from GROUP BY
because we also have to compute aggregates and the aggregate definitions are in the SELECT
clause. For each distinct set of values evident in the grouping columns, we produce a single output row containing that set of values together with any computed aggregates. We produce here {O(t1.a), O(e)}
2 and that is the result set that the remaining parts of the query can observe. The original result set is not in scope.
HAVING
- we can work with just those columns produced by the SELECT
clause3. But again, we filter rows, not columns.
and ORDER BY
can also only work with the columns produced by the SELECT
.
By the time SELECT
was done, we only had output columns anyway but the output processing is the same anyway.
Hopefully, from the above you can see that SELECT
can work in two quite different ways; but at least now you're aware of the difference and what the knock-on effects of that are.
1I'm making up terminology on the fly here, but I'm using the O()
wrapper to mean "this column will be in the final result set".
2This is the behaviour you appear to have been expecting SELECT
to always exhibit, only providing the "outputable" rows to later clauses.
3mysql contains an extension to the SQL standard that allows non-grouped and non-aggregated columns to appear as HAVING
clause predicates. They're effectively re-written to be used in the WHERE
clause instead.