3

I'm trying to understand how mysql queries work with and without GROUP BY.

Imagine I have a table of FILM_ACTORS, where each ACTOR_ID has a corresponding FILM_ID. So the same actor participates in N different movies.

I want to select the actors that participate 20 movies:

SELECT actor_id FROM film_actor GROUP BY actor_id HAVING COUNT(film_id) = 20;

This query works and returns the actor_ids that participate in 20 movies. But what if I just did:

SELECT actor_id FROM film_actor HAVING COUNT(film_id) = 20;

Why does this query only returns values if I equal it to the SIZE of film_actor table (5463):

SELECT actor_id FROM film_actor HAVING COUNT(film_id) = 5463;

In this case it returns me actor_id = 1. Why? Is it selecting film_ids without considering the corresponding actor_ids?

anuragw
  • 133
  • 7
Pedro Barros
  • 183
  • 1
  • 11
  • See if this question (and its selected answer) helps: http://stackoverflow.com/questions/6924896/having-without-group-by – Maximus2012 May 26 '15 at 16:34
  • 1
    [MySQL extends the use of GROUP BY so that the select list can refer to nonaggregated columns not named in the GROUP BY clause.](https://dev.mysql.com/doc/refman/5.0/en/group-by-handling.html) You can use this feature to get better performance by avoiding unnecessary column sorting and grouping... **HOWEVER** The server is free to choose any value from each group, so unless they are the same, the values chosen are indeterminate. So in your case, the system is randomly selecting an actor ID from a table of 5463 actors. not first nor last to have problems with mysql group by extensions. – xQbert May 26 '15 at 16:35
  • 3
    Think about the name. `GROUP BY` will help you get results for each *group* of a field. Without it, there is only one group, which is the entire table. The values shown in that one row are arbitrary. – AdamMc331 May 26 '15 at 16:47

3 Answers3

5

GROUP BY groups the results by the value of the following columns, normally used with aggregate functions (eg, COUNT).

So your first query returns one row for each value of actor_id, and the HAVING limits the results to those where the count is 20

Without a GROUP BY clause the aggregate function acts on all rows. So your 2nd query is selecting the actor_id where the count of films is 20, but without grouping the count is 5463 (ie, number of rows on the table). The actor_id returned in this situation is not determined (ie, could be any of them).

Kickstart
  • 21,403
  • 2
  • 21
  • 33
4

There's no GROUP BY in the second query. The usage of the aggregate function COUNT in the HAVING clause means the query is going to return at most one row.

Compare to this query:

SELECT actor_id, COUNT(film_id) FROM film_actor

Which returns a single row, e.g.

actor_id  COUNT(film_id)
--------  --------------
      42            5463

(NOTE: By default, MySQL will return a result for that query. Other databases would reject that query, and raise an "non aggregate not in group by" type error. The issue is that the reference to actor_id in the SELECT list. To get that query to work in other databases, we would need to remove actor_id from the SELECT list. We can get MySQL to behave the same way, if we set sql_mode to include ONLY_FULL_GROUP_BY.)

Note that the value returned for actor_id is a value from "some row". It's not deterministic which row that value is returned from, it can be from any row. The value returned for COUNT is from the entire table.


If you want the COUNT for each actor, you need a GROUP BY clause, as in your first query.

    SELECT actor_id, COUNT(film_id) FROM film_actor GROUP BY actor_id

Starting with that query as a basis, you can add a HAVING clause. And you can also remove the COUNT(film_id) from the SELECT list. But you can't remove the GROUP BY, without affecting what gets returned for COUNT(film_id).

spencer7593
  • 106,611
  • 15
  • 112
  • 140
  • 1
    So, using only COUNT() will get me the value of all counts of film_ids from the table without considering the actor_ids and only returns one row. Correct? – Pedro Barros May 26 '15 at 16:45
  • 2
    @PedroBarros I'm going to jump in and say yes, that is correct. Thinking about it very simply, `GROUP BY` helps you get results *per group*. If you use the clause, you get a number for *each group of actors*. Without it, there is only one group, the entire table. – AdamMc331 May 26 '15 at 16:46
  • @Pedros: Echoing McAdams331, yes, that is correct. Without a `GROUP BY` clause, an *aggregate* function (e.g. `COUNT`, `SUM`, `MAX` et al.) will operate on the set as *one* single group. – spencer7593 May 26 '15 at 17:48
3

So lets say you had:

+---------------------------------+
| actor_id | actor_name | film_id |
+---------------------------------+
|        4 |       John |       3 |
|        4 |       John |       4 |
|        5 |       Alex |       3 |
+---------------------------------+

When you run:

SELECT actor_id, COUNT(film_id) AS Films FROM film_actor GROUP BY actor_id;

We would get:

+------------------+
| actor_id | Films |
+------------------+
|        4 |     2 |
|        5 |     1 |
+------------------+

So, we can do:

SELECT actor_id, COUNT(film_id) AS Films FROM film_actor GROUP BY actor_id WHERE Films = 2;

This should just return actor_id of 4.

Twisty
  • 30,304
  • 2
  • 26
  • 45
  • Upvoting for showing a concrete example of what `GROUP BY` does. It wouldn't hurt to also add a counter example to show what the results of your table would be if you *didn't* use group by. – AdamMc331 May 26 '15 at 16:48