14

I am used to Microsoft technologies including SQL Server. Today I ran across a Q&A where the following passage from the MySQL documentation was quoted:

Standard SQL would reject your query because you can not SELECT non-aggregate fields that are not part of the GROUP BY clause in an aggregate query. MySQL extends the use of GROUP BY so that the select list can refer to nonaggregated columns not named in the GROUP BY clause. This means that the preceding query is legal in MySQL. You can use this feature to get better performance by avoiding unnecessary column sorting and grouping. However, this is useful primarily when all values in each nonaggregated column not named in the GROUP BY are the same for each group. The server is free to choose any value from each group, so unless they are the same, the values chosen are indeterminate.

Is MySQL breaking the standard by allowing this? How? What is the result of allowing this?

Braiam
  • 1
  • 11
  • 47
  • 78
rahularyansharma
  • 11,156
  • 18
  • 79
  • 135

4 Answers4

28

Standard SQL would reject your query because you can not SELECT non-aggregate fields that are not part of the GROUP BY clause in an aggregate query

This is correct, up to 1992.

But it is plainly wrong, from 2003 and beyond.

From SQL-2003 standard, 6IWD6-02-Foundation-2011-01.pdf, from http://www.wiscorp.com/, paragraph-7.12 (query specification), page 398:

  1. If T is a grouped table, then let G be the set of grouping columns of T. In each ((value expression)) contained in ((select list)) , each column reference that references a column of T shall reference some column C that is functionally dependent on G or shall be contained in an aggregated argument of a ((set function specification)) whose aggregation query is QS

Now MYSQL, has implemented this feature by allowing not only columns that are functionally dependent on the grouping columns but allowing all columns. This is causing some problems with users that do not understand how grouping works and get indeterminate results where they don't expect.

But you are right to say that MySQL has added a feature that conflicts with SQL-standards (although you seem to think that for the wrong reason). It's not entirely accurate as they have added a SQL-standard feature but not in the best way (more like the easy way) but it does conflict with the latest standards.

To answer your question, the reason for this MySQL feature (extension) is I suppose to be accordance with latest SQL-standards (2003+). Why they chose to implement it this way (not fully compliant), we can only speculate.

As @Quassnoi and @Johan answered with examples, it's mainly a performance and maintainability issue. But one can't easily change the RDBMS to be clever enough (Skynet excluded) to recognize functionally dependent columns, so MySQL developers made a choice:

We (MySQL) give you (MySQL users) this feature which is in SQL-2003 standards. It improves speed in certain GROUP BY queries but there's a catch. You have to be careful (and not the SQL engine) so columns in the SELECT and HAVING lists are functionally dependent on the GROUP BY columns. If not, you may get indeterminate results.

If you want to disable it, you can set sql_mode to ONLY_FULL_GROUP_BY.

It's all in the MySQL docs: Extensions to GROUP BY (5.5) - although not in the above wording but as in your quote (they even forgot to mention that it's a deviation from standard SQL-2003 while not standard SQL-92). This kind of choices is common I think in all software, other RDBMS included. They are made for performance, backward compatibility and a lot of other reasons. Oracle has the famous '' is the same as NULL for example and SQL-Server has probably some, too.

There is also this blog post by Peter Bouman, where MySQL developers' choice is defended: Debunking GROUP BY myths.

In 2011, as @Mark Byers informed us in a comment (in a related question at DBA.SE), PostgreSQL 9.1 added a new feature (release date: September 2011) designed for this purpose. It is more restrictive than MySQL's implementation and closer to the standard.

Later, in 2015 MySQL announced that in 5.7 version, the behaviour is improved to conform with the standard and actually recognize functional dependencies, (even better than the Postgres implementation). The documentation: MySQL Handling of GROUP BY (5.7) and another blog post by Peter Bouman: MySQL 5.7.5: GROUP BY respects functional dependencies!

Braiam
  • 1
  • 11
  • 47
  • 78
ypercubeᵀᴹ
  • 113,259
  • 19
  • 174
  • 235
  • 2
    Just adding to your collection: in `SQL Server`, a `NULL` can violate a unique constraint. – Quassnoi Oct 18 '11 at 16:40
  • @Quassnoi: thnx, I couldn't find something when I was writing the answer. In the msdn site they mention: `as with any value participating in a UNIQUE constraint, only one null value is allowed per column`. We conclude that `NULL` is like any other value... – ypercubeᵀᴹ Oct 18 '11 at 17:00
  • 1
    yes, and it's an apparent standard violation. `4.6.5.4` says: *A unique constraint is satisfied if and only if no two rows in a table have the same non-null values in the unique columns.* – Quassnoi Oct 18 '11 at 17:54
  • For historians, the Postgres commit from three years ago, which I believe is still the only fully standard implementation: https://github.com/postgres/postgres/commit/e49ae8d3bc588294d07ce1a1272b31718cfca5ef –  Dec 11 '13 at 21:29
  • "*to be accordance with latest SQL-standards*" - I very much doubt that. MySQL's "lose" group by handling has been in the product long before the SQL:2003 standard. It follows more the MySQL guideline: "don't throw an error, prefer to return indeterminate results instead" –  Mar 04 '14 at 16:57
  • @a_horse_with_no_name I think you are right. I later edited the question by adding the next paragraph, agreeing that this was probably performance related. Another thing I'm not sure is whether that addition was in the 2003 or in the 1999 standard. – ypercubeᵀᴹ Mar 04 '14 at 17:05
  • Can you elaborate why MySQL's implementation of the "functional dependencies" detection is even better then the one in Postgres? –  Sep 26 '16 at 17:52
  • @a_horse_with_no_name long time since I wrote this. I think I meant that Postgres only considers primary key constraints when checking FDs. If I remember well, MySQL checks UNIQUE constraints as well. Not sure if it also checks "cascading" dependencies through foreign keys, have to check. So, I suppose I should worded somehow differently than "better". Perhaps "have implemented some more checks of FDs"? What do you think? – ypercubeᵀᴹ Sep 26 '16 at 18:17
9

Is MySQL breaking the standard by allowing this? How?

It lets you write a query like that:

SELECT  a.*, COUNT(*)
FROM    a
JOIN    b
ON      b.a = a.id
GROUP BY
        a.id

Other systems would require you to add all columns from a into the GROUP BY list which makes the query larger, less maintanable and less efficient.

In this form (with grouping by the PK), this does not contradict the standard since every column in a is functionally dependent on its primary key.

However, MySQL does not really check the functional dependency and lets you select columns not functionally dependent on the grouping set. This can yield indeterminate results and should not be relied upon. The only thing guaranteed is that the column values belong to some of the records sharing the grouping expression (not even to one record!).

This behavior can be disabled by setting sql_mode to ONLY_FULL_GROUP_BY.

Braiam
  • 1
  • 11
  • 47
  • 78
Quassnoi
  • 413,100
  • 91
  • 616
  • 614
  • If `a_id` is the primary key, Postgres 9.1+ allows this syntax (according to the SQL standard), which makes sense. MySQL allows it anyways, which is dubious behavior. – Erwin Brandstetter Apr 04 '13 at 11:21
4

Short answer
It's a speed hack

That is enabled by default, but that can be disabled with this setting: https://dev.mysql.com/doc/refman/8.0/en/group-by-handling.html

Long answer The reason for the non-standard shorthand group by clause is that it's a speed hack.
MySQL lets the programmer determine whether the selected fields are functionally dependent on the group by clause.
The DB does not do any testing, but just selects the first result that it finds as the value of the field.
This results in considerable speed ups.

Consider this code:

SELECT f1, f2, f3, f4 FROM t1 GROUP BY f2   
-- invalid in most SQL flavors, valid in MySQL  

MySQL will just select the first value it finds, spending a minimum amount of time.
f1,f3, f4 will be from the same row, but this relation will fall apart if multiple tables with joins are involved.

In order to do the same something simular in SQL-server you'd have to do

SELECT MIN(f1), f2, MIN(f3), MIN(f4) FROM t1 GROUP BY f2  
-- valid SQL, but really a hack

The DB will now have to examine all results to find the minimum value, huffing and puffing.
f1, f3, f4 will most likely have no relation to each other and will not be from the same row.

If however you do:

SELECT id as `primary_key`, count(*) as rowcount, count(f2) as f2count, f2, f3, f4 
FROM t1 
GROUP BY id

All the rest of the fields will be functionally dependent on id.
Rowcount will always be 1, and f2count will be either 0 (if f2 is null) or 1.

On joins, where lots of tables are involved, in a 1-n configuration like so:

Example:

Website 1 -> n Topics 1 -> n Threads 1 -> n Posts 1 -> 1 Person.

And you do a complicated select involving all tables and just do a GROUP BY posts.id
Obviously all other fields are functionally dependent on posts.id (and ONLY on posts.id).
So it makes no sense to list more fields in the group by clause, or to force you to use aggregate functions.
In order to speed things up. MySQL does not force you to do this.

But you do need to understand the concept of functional dependency and the relations in the tables and the join you've written, so it puts a pot of burden on the programmer.
However using:

SELECT 
  posts.id, MIN(posts.f2)
  ,MIN(threads.id), min(threads.other)
  ,MIN(topics.id), ....
  ,MIN(website.id), .....
  ,MIN(Person.id), ...
FROM posts p
INNER JOIN threads t on (p.thread_id = t.id)
INNER JOIN topic to on (t.topic_id = to.id)
INNER JOIN website w ON (w.id = to.website_id)
INNER JOIN person pe ON (pe.id = p.person_id)
GROUP BY posts.id   //NEVER MIND THE SYNTAX ERROR WITH THE ALIASES

Puts exactly the same mental burden on the programmer.

Paul White
  • 212
  • 4
  • 16
Johan
  • 74,508
  • 24
  • 191
  • 319
-1

All the big DBMSs have their own flavours and extensions; otherwise why would there ever be more than one of them?

Following the SQL Standards stringently is nice and all, but providing extensions with more functionality is even better. The quote from the documentation explains how this functionality is useful.

There isn't much of a conflict in this case, so I don't really see the issue.

Lightness Races in Orbit
  • 378,754
  • 76
  • 643
  • 1,055
  • MySQL's "relaxation" of the GROUP BY restriction creates more trouble than additional functionality - I personally don't think this "feature" is useful. –  Sep 29 '11 at 10:46
  • 2
    As Catcall commented in this answer (of mine): http://stackoverflow.com/questions/2311034/is-sql-group-by-a-design-flaw/5608349#5608349 : ***"The most recent SQL standards say, in effect, that you can include columns in the SELECT list that aren't in the GROUP BY list if and only if those columns are functionally dependent on columns that are in the GROUP BY list. The objections to MySQL usually have to do with its allowing you to include columns in the SELECT list that aren't functionally dependent on any columns in the GROUP BY list. "*** – ypercubeᵀᴹ Sep 29 '11 at 10:47
  • @a_horse_with_no_name: Subjective. _Why_ they opted to do this is not answerable; in hindsight, this is not a constructive question at all. – Lightness Races in Orbit Sep 29 '11 at 10:47
  • @TomalakGeret'kal: I know it's subjective. That's why I wrote "*I personally...*". If other people like to have unpredictable results then it's their right to find this useful. –  Sep 29 '11 at 10:57
  • @a_horse_with_no_name: Oh I wasn't disagreeing with your comment. Remember, though, that the results are not unpredictable when all the values are the same, which the passage in the documentation says quite clearly. – Lightness Races in Orbit Sep 29 '11 at 10:58
  • anyone who like to have unpredictable results anyone ? – rahularyansharma Sep 29 '11 at 10:59
  • @rahularyansharma: Right so basically this is just a MySQL bashing fest? Do you actually have a constructive question to ask? – Lightness Races in Orbit Sep 29 '11 at 11:00
  • @rahularyansharma: I'm pretty sure any decent programmer has needed unpredictable results many times in his programming career. Have you never used a `random()` function? – ypercubeᵀᴹ Sep 29 '11 at 12:22
  • i am thinking from starting of this question that somebody surely make random() function example for unpredictable result. but every user know it will return a random number sir. it is made for this. Is group by in MYSQl made for unpredictable results ? sorry my intention is never to hurt any developer either mysql or mssql . this is just a question why so many intelligent people make such choice to do this in MYSQL which is not logically looking correct.if anybody hurted sorry to all. – rahularyansharma Sep 29 '11 at 12:34
  • @rahularyansharma: I've explained how it's completely logical, as has the manual. – Lightness Races in Orbit Sep 29 '11 at 13:21
  • 1
    @rahularyansharma: Yes, your point is valid and I thought you would say something like that. My point (and Tomalak's) is that a MySQL developer should know that (a `GROUP BY` can return "random" results if one isn't careful), having read the docs. – ypercubeᵀᴹ Sep 29 '11 at 18:06