1

I was taught and heard that in sql/mysql, items in select clause must appear in the GROUP BY clause or be used in an aggregate function as in here

However, the example below may have changed my mind.

Two tables: Student (sid is the key)

sid  | name | email
========================
99901| mike | mike@a.edu
99902| jane | jane@b.edu
99903| peter| pete@b.edu

Took (sid+oid together is the key, oid stands for offering id)

sid  | oid| grade
==================
99901| 1  | 100
99901| 2  | 30
99901| 3  | 40
99902| 4  | 100
99902| 5  | 100
99902| 6  | 40
99903| 6  | 95

Question: I want to find the sid, name and average grade of each student who has taken at least 2 courses.

Answer:

select s.sid, name, avg(grade) as average
from Student as s, Took as t
where s.sid = t.sid
group by s.sid
having count(*) >= 2;

Result:

sid  | name | avgerage
=======================
99901| mike | 56.6667
99902| jane | 80.0000

Based on must appear in the GROUP BY clause or be used in an aggregate function, the query should have been incorrect because name is neither in group clause nor an aggregate function.

I looked some posts and this, my understanding is that although name is neither in group clause nor an aggregate function, we group by sid which is the key and each sid only correspond to one name, so it won't return multiple options from which sql doesn't know which one to return. To confirm my understanding, if I select one more column email, it's still ok; but if I select oid, it gives error because each sid corresponds to more than one oid.

Could someone correct my understanding if it is wrong or elaborate more on this statement: must appear in the GROUP BY clause or be used in an aggregate function

Thanks.

First Edit:

Btw, I tested in MySQL 8.0.17

Second Edit:

Just a summary of useful links when you read the answers/comments below.

Functional depedency

SQL standard change

Kenny
  • 95
  • 1
  • 7
  • `mysql` will make inferences for you when it comes to this, while all other major dbms will rigidly enforce the rule. I'd recommend simply following the rule even in `mysql` to save yourself any confusion, but you seem to have a perfect understanding of it. – Aaron Dietz Oct 03 '19 at 16:00
  • @AaronDietz Don't know about other vendors, but in MySQL there can be performance issues when non-indexed columns are included in the GROUP BY clause. – Paul Spiegel Oct 03 '19 at 16:01
  • 1
    See: [Detection of Functional Dependence](https://dev.mysql.com/doc/refman/8.0/en/group-by-functional-dependence.html) – Paul Spiegel Oct 03 '19 at 16:04
  • The [ONLY_FULL_GROUP_BY](https://dev.mysql.com/doc/refman/8.0/en/sql-mode.html#sqlmode_only_full_group_by) controls this, btw. (It's on by default in mysql 8...) [More reading](https://dev.mysql.com/doc/refman/8.0/en/group-by-handling.html) – Shawn Oct 03 '19 at 16:09
  • This blogpost (old but still relevant till today) can clear all misconceptions and myths surrounding `FULL_GROUP_BY` and SQL standard: http://rpbouman.blogspot.com/2007/05/debunking-group-by-myths.html – Madhur Bhaiya Oct 03 '19 at 16:51
  • All non-aggregated columns must be in `GROUP BY`. – Eric Oct 03 '19 at 17:04
  • *"All non-aggregated columns must be in GROUP BY"* Yes before ANSI SQL standard 1999 @Eric after ANSI SQL standard 1999 has Functional Dependency (which also exists for 20 years now) which allows non-aggregated columns to be used in the SELECT it is really a thing something which MySQL and PostgreSQL supports.. – Raymond Nijland Oct 03 '19 at 17:12

2 Answers2

2

First, you should use proper, explicit JOIN syntax:

select s.sid, s.name, avg(grade) as average
from Student s join
     Took t
     on s.sid = t.sid
group by s.sid
having count(*) >= 2;

This will work because of something called functional dependencies. Basically, this is the part of the standard that says: If you group by a primary key or unique key, then you can include any of the columns from that table.

Here is documentation on the subject.

That is, because the database knows that s.sid is unique, it is safe to use other columns. This is part of the standard. The only other common database that I am aware of that supports this is Postgres.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Thanks, it's very helpful. One thing I don't understand is why the explicit `JOIN` is better than just `cartisian product`, in this example, they perform the same functionality, maybe because of the speed issue? – Kenny Oct 03 '19 at 17:21
  • 1
    Readablity as it is directly clear this is inner joining, in the MySQL optimisation those explicit JOIN would become comma join again @Kenny As comma join only supports CROSS JOIN/INNER JOIN result, where explicit JOIN in MySQL supports CROSS/LEFT/RIGHT/INNER so the query can be changed more simply if you need other results .. – Raymond Nijland Oct 03 '19 at 17:32
  • @RaymondNijland Thanks. I also just found it [here](https://stackoverflow.com/questions/20138355/whats-the-difference-between-comma-separated-joins-and-join-on-syntax-in-mysql) – Kenny Oct 03 '19 at 17:48
1

You were taught right.

According to the SQL Standard when you use GROUP BY the columns that can appear in the SELECT clause fall into three categories:

  1. Columns included in the GROUP BY clause. In this case you have s.sid.
  2. Aggregated columns. In this case you have avg(grade).
  3. Functionally dependent columns of case #1. Since s.sid is the PK of the table, you can include s.name without aggregating it.

So all good.

However, you should know that MySQL 5.7.4 and older do allow you to include other columns in non-aggregated form. This is a bug/feature of MySQL that I personally find error prone. If you do this, MySQL will silently pick one value randomly without aggregating it and without telling you.

This functionality can be turned on by using the ONLY_FULL_GROUP_BY configuration parameter (as @Shawn pointed out in the comments) in newer versions of MySQL, to allow old/bad queries to run. I would try to avoid using it, though.

The Impaler
  • 45,731
  • 9
  • 39
  • 76
  • MySQL 5.7.5+ with running sql_mode ONLY_FULL_GROUP_BY does correctly implement ANSI/ISO SQL 1999 standard regrading of detection of functional dependence rules as far i know... – Raymond Nijland Oct 03 '19 at 16:50