You should be able to understand the problem on a simplified case that doesn't even involve a JOIN.
The query SELECT x,[other columns] GROUP BY x
expresses the fact that for every distinct value of x
, the [other columns] must be output with only one row for every x
.
Now looking at a simplified example where the student
table has two entries:
stud_id=1, class_id=1
stud_id=1, class_id=2
And we ask for SELECT stud_id,class_id FROM student GROUP BY class_id
.
There is only one distinct value of stud_id
, which is 1.
So we're telling the SQL engine, give me one row with stud_id=1
and the value of class_id
that comes with it. And the problem is that there is not one, but two such values, 1 and 2. So which one to choose? Instead of choosing randomly, the SQL engine yields an error saying the question is conceptually bogus in the first place, because there's no rule that says each distinct value of stud_id
has its own corresponding class_id
.
On the other hand, if the non-GROUP'ed output columns are aggregate functions that transform a series of values into just one, like min
, max
, or count
, then they provide the missing rules that say how to get only one value from several. That's why the SQL engine is OK with, for instance: SELECT stud_id,count(class_id) FROM student GROUP BY stud_id;
.
Also, when faced with the error column "somecolumn" must appear in the GROUP BY clause, you don't want to just add columns to the GROUP BY
until the error goes away, as if it was purely a syntax problem. It's a semantic problem, and each column added to the GROUP BY changes the sense of the question submitted to the SQL engine.
That is, GROUP BY x,y
means for each distinct value of the (x,y) couple. It does not mean GROUP BY x
, and hey, since it leads to an error, let's throw in the y
as well!