42

I have a table emp with following structure and data:

name   dept    salary
-----  -----   -----
Jack   a       2
Jill   a       1
Tom    b       2
Fred   b       1

When I execute the following SQL:

SELECT * FROM emp GROUP BY dept

I get the following result:

name   dept    salary
-----  -----   -----
Jill   a       1
Fred   b       1

On what basis did the server decide return Jill and Fred and exclude Jack and Tom?

I am running this query in MySQL.

Note 1: I know the query doesn't make sense on its own. I am trying to debug a problem with a 'GROUP BY' scenario. I am trying to understand the default behavior for this purpose.

Note 2: I am used to writing the SELECT clause same as the GROUP BY clause (minus the aggregate fields). When I came across the behavior described above, I started wondering if I can rely on this for scenarios such as: select the rows from emp table where the salary is the lowest/highest in the dept. E.g.: The SQL statements like this works on MySQL:

SELECT A.*, MIN(A.salary) AS min_salary FROM emp AS A GROUP BY A.dept

I didn't find any material describing why such SQL works, more importantly if I can rely on such behavior consistently. If this is a reliable behavior then I can avoid queries like:

SELECT A.* FROM emp AS A WHERE A.salary = ( 
            SELECT MAX(B.salary) FROM emp B WHERE B.dept = A.dept)
Harish Shetty
  • 64,083
  • 21
  • 152
  • 198
  • 1
    note that rows order are re-shuffled on MVCC-based database, the first row can become the last row if you update that first row – Michael Buen Oct 20 '09 at 00:58

9 Answers9

39

Read MySQL documentation on this particular point.

In a nutshell, MySQL allows omitting some columns from the GROUP BY, for performance purposes, however this works only if the omitted columns all have the same value (within a grouping), otherwise, the value returned by the query are indeed indeterminate, as properly guessed by others in this post. To be sure adding an ORDER BY clause would not re-introduce any form of deterministic behavior.

Although not at the core of the issue, this example shows how using * rather than an explicit enumeration of desired columns is often a bad idea.

Excerpt from MySQL 5.0 documentation:

When using this feature, all rows in each group should have the same values
for the columns that are omitted from the GROUP BY part. The server is free
to return any value from the group, so the results are indeterminate unless
all values are the same. 
mjv
  • 73,152
  • 14
  • 113
  • 156
  • @mjv, I'm considering user the statement `select user_id, order_id from (select user_id, order_id from orders order by user_id, order_id desc) a group by user_id` to select the latest order of each user_id, won't that reach my purpose? – mingchau Feb 25 '19 at 09:56
9

This is a bit late, but I'll put this up for future reference.

The GROUP BY takes the first row that has a duplicate and discards any rows that match after it in the result set. So if Jack and Tom have the same department, whoever appears first in a normal SELECT will be the resulting row in the GROUP BY.

If you want to control what appears first in the list, you need to do an ORDER BY. However, SQL does not allow ORDER BY to come before GROUP BY, as it will throw an exception. The best workaround for this issue is to do the ORDER BY in a subquery and then a GROUP BY in the outer query. Here's an example:

SELECT * FROM (SELECT * FROM emp ORDER BY name) as foo GROUP BY dept

This is the best performing technique I've found. I hope this helps someone out.

Samuel Hodge
  • 139
  • 1
  • 2
  • 7
    I can find no reference to support your assertion that “GROUP BY takes the first row that has a duplicate and discards any rows that match after it in the result set.” To the contrary, MySQL in particular clearly states that the value for nonaggregated columns is taken arbitrarily from any row in the group. – danorton Mar 07 '12 at 07:28
  • 3
    This is **not** helpful. I consider it as worsening the situation. Instaed of one non-standard feature, you are now using two non-standard features. The results can be erroneous and there is no guarantee that you will get the first result per dept, ordered by name. – ypercubeᵀᴹ Jan 28 '13 at 09:21
  • There's no evidence that this is the case. Experiment indicates that this might be happening, but if it is not written into the specification, it may change from under us. – Rikki Aug 02 '13 at 14:46
  • Having the sub-query (returning one name, using LIMIT 1) as part of the WHERE clause is the approach I took for a similar problem. e.g. SELECT * FROM emp WHERE name = (SELECT name FROM emp ORDER BY name LIMIT 1); – Rikki Aug 02 '13 at 14:59
  • 2
    **Careful** This answer is **INCORRECT**. It LOOKS LIKE it should work, it would be great if it worked but it DOES NOT work... – BenL Dec 19 '14 at 12:04
  • 1
    I agree, this does not work. It works SOMETIMES but they values are indeterminate. My tests were failing at random and the problem was using the technique described above. – Kamil Latosinski Nov 20 '17 at 11:37
4

As far as I know, for your purposes the specific rows returned can be considered to be random.

Ordering only takes place after GROUP BY is done

Nico Haase
  • 11,420
  • 35
  • 43
  • 69
Joel L
  • 3,031
  • 1
  • 20
  • 33
2

You can put a:

SET GLOBAL sql_mode=(SELECT REPLACE(@@sql_mode,'ONLY_FULL_GROUP_BY',''));

before your query to enforce SQL standard GROUP BY behavior

Kate Orlova
  • 3,225
  • 5
  • 11
  • 35
cube.head
  • 95
  • 10
1

I find that the best thing to do is to consider this type of query unsupported. In most other database systems, you can't include columns that aren't either in the GROUP BY clause or in an aggregate function in the HAVING, SELECT or ORDER BY clauses.

Instead, consider that your query reads:

SELECT ANY(name), dept, ANY(salary)
FROM emp 
GROUP BY dept;

...since this is what's going on.

Hope this helps....

Rob Farley
  • 15,625
  • 5
  • 44
  • 58
0

I think ANSI SQL requires that the select includes only fields from the GROUP BY clause, plus aggregate functions. This behaviour of MySQL looks like returns some row, possibly the last one the server read, or any row it had at hand, but don't rely on that.

Petruza
  • 11,744
  • 25
  • 84
  • 136
  • About Marius comment: (I can't comment on it due to low scoring) As others said, Order By acts on the result of the Grouping, there is no point in sorting rows that will be collapsed by a grouping. Instead, you could select MAX( name ) which would actually return the last name if the rows were ordered alfabetically ascending. – Petruza Dec 03 '09 at 15:07
0

This would select the most recent row for each person:

SELECT * FROM emp
WHERE ID IN
(
    SELECT
        MAX(ID) AS ID
    FROM
        emp
    GROUP BY
        name
)
Paul Chris Jones
  • 2,646
  • 1
  • 23
  • 21
-1

If you are grouping by department does it matter about the other data? I know Sql Server will not even allow this query. If there is a possibility of this sounds like there might be other issues.

CSharpAtl
  • 7,374
  • 8
  • 39
  • 53
-2

Try using ORDER BY to pick the row that you want.

SELECT * FROM emp GROUP BY dept ORDER BY name ASC;

Will return the following:

name   dept    salary
-----  -----   -----
jack   a       2
fred   b       1
Marius
  • 57,995
  • 32
  • 132
  • 151