65

My lecturer stated:

All column names in SELECT list must appear in GROUP BY clause unless name is used only in an aggregate function.

I'm just wanting some confirmation of this as I cannot think of a logical explanation as to why it should be true.

Mehdi Charife
  • 722
  • 1
  • 7
  • 22
user559142
  • 12,279
  • 49
  • 116
  • 179
  • 9
    he forgot to qualify ... unless you are running mysql which is ... weird – Sam Saffron May 13 '11 at 00:48
  • You may want to look at http://stackoverflow.com/questions/1135997/why-no-select-foo-group-by-foo-id-in-postgres – James Black May 13 '11 at 00:49
  • @Sam Saffron: Sybase also is (or was when I used it several years ago) another exception, and it was a much, much worse exception. It forced an implicit outer join! I can't count how many disaster queries I accidentally started because of that... – btilly May 13 '11 at 02:10
  • 1
    Did your lecturer say anything about the converse? Namely, "Does every column in a `GROUP BY` clause have to appear in the `SELECT` list?" – Jus12 Jan 14 '16 at 04:50
  • @Jus12 Did you manage to find an answer for that? – Mehdi Charife Jun 20 '23 at 15:53

6 Answers6

50

Imagine the following:

    A    B    C
   Cat   10   False
   Dog   25   True
   Dog   20   False
   Cat   5    False

If you select A, B and Group By Only A - what would your output be? You'd only have two rows (or tuples) because you have two values for A - but how does it display B?

If you group by A, B, you'd get four rows, no problems there. If you group by A and perform a function on B - like SUM(B) then you get two rows again:

    Cat    15
    Dog    45

But if you select A, B and only group by A - it doesn't know what to do. Truthfully, I believe there are some databases out there that will select a random value for B in that case and I believe there are some that will give you an error message.

Rob P.
  • 14,921
  • 14
  • 73
  • 109
  • 1
    In MySQL, the non grouped by columns will match the key column, not be shuffled randomly. If you grouped by A, you might get Cat 10 or Cat 5, but never Cat 25. – Buttle Butkus Apr 05 '14 at 01:55
  • 1
    Sorry, not never. As long as no aggregates are used on other columns, the group by spits out a complete, random row from among the group. But the results of an aggregate on one column do not cause the other columns to match. So the aggregate will cause disjointed results. – Buttle Butkus Apr 05 '14 at 05:43
  • 1
    well, (at least for sqlite3) the unaggregated columns are random whether you aggregate some other columns or not – Alex Brown Feb 12 '15 at 07:34
22

That's historically true. Omitting unaggregated columns leads to indeterminate behavior. SQL aims at fully determinate behavior.

But SQL standards have recently changed to let you omit from the GROUP BY clause columns that are functionally dependent on columns that are in the GROUP BY. PostgreSQL follows the more recent SQL standards. (It's not the only one.) Behavior is still fully determinate.

create table a (
  a_id integer primary key,
  xfr_date date not null
);

create table b (
  a_id integer not null references a (a_id),
  recd_date date not null,
  units_recd integer not null 
    check (units_recd >= 0),
  primary key (a_id, recd_date)
);

select a.a_id, a.xfr_date, sum(b.units_recd)
from a
inner join b on a.a_id = b.a_id
group by a.a_id; -- The column a.xfr_date is functionally dependent 
                 -- on a.a_id; it doesn't have to appear in the 
                 -- GROUP BY clause.

The notable deviation from SQL standards is MySQL. It lets you omit just about everything from the GROUP BY. But that design choice makes its behavior indeterminate when you omit columns that are in the SELECT list.

Hooman Bahreini
  • 14,480
  • 11
  • 70
  • 137
Mike Sherrill 'Cat Recall'
  • 91,602
  • 17
  • 122
  • 185
  • 1
    "functionally dependent" is interesting: please see my comment here http://stackoverflow.com/questions/5974636/count-query-in-sql/5974669#comment-6895150 – gbn May 13 '11 at 05:06
  • 1
    Yes, *functionally dependent* means behavior under the new standard is still determinate. It behaves as if you'd specified all the columns; it just lets you leave some columns out of the GROUP BY if the dbms can derive them from the FDs in its system tables. – Mike Sherrill 'Cat Recall' May 13 '11 at 09:08
  • Thanks. Call me old fashioned and/or cynical but this will simply confuse a lot of folk... – gbn May 13 '11 at 09:10
  • @gbn: Well, you don't have to use it. The old rules still work, and the new rule is just shorthand for the old rule. If you work with a lot of other database developers, it might take a little while for you to get used to seeing it in their code. But *you* won't have as much trouble as the people who don't understand what a functional dependency is. And judging from questions on SO and on Usenet, there are a *lot* of those. – Mike Sherrill 'Cat Recall' May 13 '11 at 09:27
  • SQL Server allows some functionally-dependent columns to be omitted e.g. `CREATE TABLE T (col1 INTEGER, col2 AS col1);` `SELECT col1, col2 FROM T GROUP BY col1;` – onedaywhen May 26 '11 at 14:43
  • Also SQLite is a [notable exception](https://www.sqlite.org/lang_select.html): " If the expression is an aggregate expression, it is evaluated across all rows in the group. Otherwise, it is evaluated against a single arbitrarily chosen row from within the group". Basically, it just returns one row in the group without a defined criteria – Jacopofar Aug 29 '18 at 12:42
  • Explain plan (MySQL Ver 14.14 Distrib 5.7.25) shows that *possible_keys* gets null when using `select id, name from table group by id, name` instead of `select id, name from table group by id`. Performances are better with second query. – nicolas Jan 25 '19 at 11:55
10

Actually, in MySQL you don't have to group by all columns. You can just group by whatever columns you want. The problem is, it will just pull a random value (from the set of available rows in the group) for the fields which aren't in the group by. If you know that you are grouping by something that is a unique key, there's no point in grouping by the rest of the fields, as they will already all have the same value anyway. It can actually speed it up to not have to group by every field when it is completely unnecessary.

Kibbee
  • 65,369
  • 27
  • 142
  • 182
  • 1
    'it will just pull a random value for the fields which aren't in the group by' - This is the reason. DB won't know which of the values it should select for column you are not grouping by, should it pick random, min, max, avg what?, hence most of them force you to select one value. – YetAnotherUser May 13 '11 at 00:52
  • Yes, but if the developer is smart, they will group by something that is the key, and save the DB some work. It also leads to a lot of confusion where people write aggregate statements without a group by, and wonder why there's only 1 row, and the results don't make any sense. – Kibbee May 13 '11 at 00:57
2

So the simple answer is: It depends. Mysql allows it, vertica doesn't.

There is actually a valid use case for omitting and that is when you are already selecting say with MIN().

Here is an actual example for event tracking. Imaging you have credit and purchase events.

For simplicity we say a=credit, b,c,d are some kind of purchase event, and time is tracked with a running number. Now you want to find the date of the first purchase after each credit. We also happen to have only one customer 0:

create table events (user_id int ,created_at int, event varchar(255));
insert into events values (0,0, 'a');
insert into events values (0,1, 'b');
insert into events values (0,2, 'c');
insert into events values (0,3, 'a');
insert into events values (0,4, 'c');
insert into events values (0,5, 'b');
insert into events values (0,6, 'a');
insert into events values (0,7, 'a');
insert into events values (0,8, 'd');

mysql> SELECT user_id, MAX(purchased) AS purchased, spent, event FROM (SELECT e1.User_ID AS user_id, e1.created_at AS purchased, MIN(e2.created_at) AS spent, e2.event AS event FROM events e1, events e2 WHERE e1.user_id = e2.user_id AND e1.created_at <= e2.created_at AND e1.Event = 'a' AND e2.Event != 'a' GROUP BY e1.user_id, e1.created_at) e3 GROUP BY user_id, spent;
+---------+-----------+-------+-------+
| user_id | purchased | spent | event |
+---------+-----------+-------+-------+
|       0 |         0 |     1 | b     |
|       0 |         3 |     4 | c     |
|       0 |         7 |     8 | d     |
+---------+-----------+-------+-------+
3 rows in set (0.00 sec)

looks good in mysql, does not work in vertica:

ERROR 2640: Column "e2.event" must appear in the GROUP BY clause or be used in an aggregate function

if I omit the event column, it works in both, but I do actually want to know what specific value event had for the row that min selected.

So my answer ends with a request for comment :) Any ideas?

Michael Will
  • 105
  • 1
  • 6
2

If you are grouping on something you cannot see the individual values of non-grouped columns because there may be more than one value within each group. All you can do is report on aggregate functions (sum, count, min & etc) -- these are able to combine the multiple values into a single cell in the result.

hsmiths
  • 1,257
  • 13
  • 17
  • you can get values for the others, but the spec says it's a random column. Anecdotally, for sqlite3, it's not random, it's the last. – Alex Brown Feb 12 '15 at 07:36
1

There are exceptions as noted by Sam Saffron but generally what your lecturer said is true.

If I select 3 columns and group by 2 what should the RDBMS do with the 3rd column?

The developers of the RDBMS may make a decision of how to handle the extra colum (as it appears MySQL's developers have) but is it the decision I would have made or the one I want when writing the select? Will the decision always be valid? I certainly prefer the Oracle-like approach of forcing me to explicitly state what should happen.

If I select 3 columns and group by 2 should the RDBS group by all 3, pick a random value from the 3rd, the biggest or littlest, the most common?

Karl
  • 3,312
  • 21
  • 27
  • It seems RBDMS apply Group By **after** Select for that it gives indeterminate behavior if it's the case, why don't apply Group By **before** Select? – La VloZ Merrill Dec 27 '17 at 10:03