Are arbitrarily SELECTed non-aggregate columns guaranteed to come from the same row with a MySQL GROUP BY?

Question

More of a question of interest than something I'm looking to use

Say I run the following query:

  SELECT su.id, su.name, sua.line_1, sua.line_2
    FROM site_user su
    JOIN site_user_address sua
      ON sua.user_id = su.id
   WHERE su.id = 1 
GROUP BY su.id /* id is the PK for site_user */

Are sua.line_1 and sua.line_2 guaranteed to come back from the same site_user_address row?

I'm aware that, at the very least, the site_user_address row is arbitrarily selected

I couldn't find anything in the docs, and can't think of a reliable way to test this

UPDATE

This is not a duplicate as far as I'm aware.. I'm wondering if the arbitrary selection of a value will come from the same arbitrarily selected row when choosing two different columns from the same table

Other questions seem to be less specific, and more directed towards the arbitrary selection of both row and value of one or more non-aggregate columns

Possible duplicate of [Error related to only\_full\_group\_by when executing a query in MySql](https://stackoverflow.com/questions/34115174/error-related-to-only-full-group-by-when-executing-a-query-in-mysql) — Madhur Bhaiya, Sep 28 '18 at 16:49
Nope, not guaranteed. They usually do, but it is not guaranteed, and if you did `sua.line_1, sua.line_2, min(sua.something)` the line_ values should definitely should not be assumed to come from the row the min value did. — Uueerdo, Sep 28 '18 at 16:50
That's why no other dbms will not even accept your query, because it doesn't make sense. You need to have all the non-aggregated columns to be included in the `GROUP BY`. — Eric, Sep 28 '18 at 17:05
@Eric `GROUP BY` without aggregate functions is more-or-less a cheezy way of doing `SELECT DISTINCT`. — O. Jones, Sep 28 '18 at 17:08
@Eric I have found that grouping by PK alone is an easy short hand way to get a full table row when joining to multiple entries in another table. For example, the `su.name` in the query above, which is always consistent with `su.id`. It may be advised by purists, but you don't **need** to have all the non-aggregated columns included for MySQL, and it makes sense under certain conditions. — Arth, Sep 29 '18 at 11:47

O. Jones · Accepted Answer · 2018-09-29T15:08:52.230

You are indeed exploiting the notorious non-standard MySQL implementation of GROUP BY. Read this for more info https://dev.mysql.com/doc/refman/8.0/en/group-by-handling.html

Your query is equivalent to using ANY_VALUE() for the values from the second table, like this.

  SELECT su.id, su.name, 
         ANY_VALUE(sua.line_1) line_1, 
         ANY_VALUE(sua.line_2) line_2
    FROM site_user su
    JOIN site_user_address sua
      ON sua.user_id = su.id
   WHERE su.id = 1 
   GROUP BY su.id

As long as site_user.id is unique for each row, you'll get the right name because it's dependent on the id value.

But for the values from the other table, your results will be, formally speaking, non-deterministic or unpredictable. Do they come back from the same row as each other? Unpredictable.

Unpredictable is like random, but worse. Random implies you'll get different values sometimes, so you can catch problems in test. Unpredictable means you'll get the same values every time, until you don't. Usually changes in the chosen values occur when tables and indexes grow larger.

You may want to use a query that gives predictable results. For example, this one will give back the rows of site_user_address that have the highest value of site_user_address.id.

  SELECT su.id, su.name, 
         sua.line_1, sua.line_2
    FROM site_user su
    JOIN (
              SELECT MAX(id) id, user_id
                FROM site_user_address
               GROUP BY user_id
         ) sumax               ON su.id = sumax.user_id
    JOIN site_user_address sua ON sumax.id = sua.id
   WHERE su.id = 1

The subquery:

              SELECT MAX(id) id, user_id
                FROM site_user_address
               GROUP BY user_id

retrieves the largest site_user_address.id value for each user_id value. You can then join to the site_user_address table to pull the detail values for those id values.

Beware the temptation to infer that behaviors are designed into a DBMS, but not documented, just because you observe them. The behavior in your question is documented. The documentation says it is non-deterministic. That means that the DBMS query planner is free to satisfy your query in whatever way seems most efficient. As tables and indexes grow, the query planner can, and does, choose different ways to satisfy the same query. Different query plans can also come from different parts of tables being resident in caches. Etc. Etc.

This unpredictability is hard for programmers to wrap our brains around. We wouldn't tolerate unpredictability in procedural Java or php code. But SQL is declarative, and many thousands of programmer years have gone into making queries run fast. Don't try to outsmart a DBMS, even if you're Michael Stonebraker himself.

Thanks! I suppose even if those two columns always do come from the same row, it's not documented as such, so it could be implemented differently in later versions. I guess there's a use case for get me a user, and any of the addresses they may have (with the address parts consistent), but there's a myriad of different ways of doing that without relying on undocumented behaviour! — Arth, Sep 29 '18 at 11:56
Thanks for the update, I totally agree with you.. I used a poor choice of words, '.. relying on undocumented behaviour' should read 'relying on behaviour not in line with the documentation'. — Arth, Oct 02 '18 at 15:27

Are arbitrarily SELECTed non-aggregate columns guaranteed to come from the same row with a MySQL GROUP BY?

1 Answers1