Why do we need GROUP BY with AGGREGATE FUNCTIONS?

Question

I saw an example where there was a list (table) of employees with their respective monthly salaries. I did a sum of the salaries and saw the exact same table in the ouptput. That was strange.

Here is what has to be done - we have to find out how much money we pay this month as employee salaries. For that, we need to sum their salary amounts in the database as shown:

SELECT EmployeeID, SUM (MonthlySalary) 
FROM Employee
GROUP BY EmpID

I know that I get an error if I don't use GROUP BY in the above code. This is what I don't understand.

We are selecting EmployeeID from the Employee table. SUM() is being told that it has to add the MonthlySalary column, from the Employee table. So, it should directly go and add those numbers up instead of grouping them and then adding them.

Thats how a person would do it - look at the employee table and add all the numbers. Why would they take the trouble to group them and then add them up?

If you want to sum the entire column without grouping, then do not use SQL group by. SQL group by is not doing what a person would be doing. In SQL, grouping makes no sense if you are not doing something to the groups. — Eric Leschinski, Dec 21 '12 at 23:37
The groups of EmpID from the Employee table. Instead of trying to compare what group by is doing relative to what a person does, ask instead: How does SQL group by work and how can we align it with what we want: http://stackoverflow.com/questions/7434657/how-does-group-by-works — Eric Leschinski, Dec 21 '12 at 23:43
because a person would implicitly infer that he/she has to sum the salary number across the same employee id. A human would scan the table and sum all the salaries corresponding to the same employee id. The "corresponding to the same employee id" specification is the "GROUP By EmpID" part that a human would infer and a computer would not :-) The fact that you are selecting "EmployeeID" may have nothing to do with the fact that you want to sum all the salaries across the same employee id — Gianluca Ghettini, Aug 10 '18 at 17:56

score 62 · Accepted Answer · edited Jan 15 '14 at 11:49

It might be easier if you think of GROUP BY as "for each" for the sake of explanation. The query below:

SELECT empid, SUM (MonthlySalary) 
FROM Employee
GROUP BY EmpID

is saying:

"Give me the sum of MonthlySalary's for each empid"

So if your table looked like this:

+-----+------------+
|empid|MontlySalary|
+-----+------------+
|1    |200         |
+-----+------------+
|2    |300         |
+-----+------------+

result:

+-+---+
|1|200|
+-+---+
|2|300|
+-+---+

Sum wouldn't appear to do anything because the sum of one number is that number. On the other hand if it looked like this:

+-----+------------+
|empid|MontlySalary|
+-----+------------+
|1    |200         |
+-----+------------+
|1    |300         |
+-----+------------+
|2    |300         |
+-----+------------+

result:

+-+---+
|1|500|
+-+---+
|2|300|
+-+---+

Then it would because there are two empid 1's to sum together. Not sure if this explanation helps or not, but I hope it makes things a little clearer.

score 9 · Answer 2 · edited Jan 15 '14 at 11:51

If you wanted to add up all the numbers you would not have a GROUP BY:


SELECT SUM(MonthlySalary) AS TotalSalary
FROM Employee

+-----------+
|TotalSalary|
+-----------+
|777400     |
+-----------+

The point of the GROUP BY is that you get a separate total for each employee.

+--------+------+
|Employee|Salary|
+--------+------+
|John    |123400|
+--------+------+
|Frank   |413000|
+--------+------+
|Bill    |241000|
+--------+------+

score 8 · Answer 3 · answered Dec 22 '12 at 00:19

8

The sad thing is that there is one database that supports the syntax you are suggesting:

SELECT EmployeeID, SUM (MonthlySalary) 
FROM Employee

However, MySQL does not do what you expect. It returns the overall sum of the MonthlySalary for everyone, and one arbitrary EmployeeId. Alas.

Your question is about SQL syntax. The answer is that is how SQL has been defined, and it is not going to change. Determining the aggregation fields from the SELECT clause is not unreasonable, but it is not how this language is defined.

I do, however, have some sympathy for the question. Many people learning SQL think of "grouping" as something done in the context of sorting the rows. Something like "sort the cities in the US and group them by state in the output". Makes sense. But "group by" in SQL really means "summarize by" not "keep together".

answered Dec 22 '12 at 00:19

Gordon Linoff

1,242,037
58
646
786

`Many people learning SQL think of "grouping" as something done in the context of sorting the rows. Something like "sort the cities in the US and group them by state in the output"` - How can you achieve grouping in SQL then ? – user Jul 10 '14 at 17:28
Sadly **SQLite** also supports mixing aggregate and non-aggregate columns without a *GROUP BY* clause. **PostgreSQL** objects and provides a helpful error message. – Clint Pachl Sep 05 '20 at 08:12
@ClintPachl . . . Actually, most databases provide helpful error messages. Postgres now supports the standard which allows any column in a table in the `select` when the `group by` columns include the primary keys/unique keys for the table. – Gordon Linoff Sep 05 '20 at 12:30
@GordonLinoff yes, but I don't think that's what we are talking about here. – Clint Pachl Sep 06 '20 at 21:22
@GordonLinoff Well, SQLite is apparently not "most databases". Its error messages are imprecise and many times unhelpful. For example, I just debugged a WITH query containing 5 sections and a final compilation query. It's about 244 lines long including SQL and comments. I forgot a comma between two sections on line 161. Guess what SQLite's error message was? *Error: near line 49: near "SELECT": syntax error* Guess what's on line 49? The keyword **WITH**; that's it! So **not** helpful. And very frustrating! – Clint Pachl Sep 06 '20 at 21:35

score 3 · Answer 4 · answered Dec 21 '12 at 23:40

If you don't specify GROUP BY, aggregate functions operate over all the records selected. In that case, it doesn't make sense to also select a specific column like EmployeeID. Either you want per-employee totals, in which case you select the employee ID and group by employee, or you want a total across the entire table, so you leave out the employee ID and the GROUP BY clause.

In your query, if you leave out the GROUP BY, which employee ID would you like it to show?

Why do we need GROUP BY with AGGREGATE FUNCTIONS?

4 Answers4

Linked