Group By Vs Distinct in SQL

Question

SELECT continent, COUNT(name)
  FROM world
 WHERE population>200000000
 GROUP BY continent

When i execute the query above the query runs fine. It basically shows the number of countries in each continent that has a population larger than 200000000.

However when I modify my query to the below :

  SELECT DISTINCT(continent), COUNT(name)
      FROM world
     WHERE population>200000000

This does not work. I am wondering what the reason is. In this case I am saying for each distinct continent count the total countries with population larger than 200000000.

I just want to understand the reasoning so i can become better at writing queries.

Your 2nd query wouldn't run in most databases (`mysql` will allow it). If you are selecting fields that aren't used in an aggregate function such as `min`, `max` or `count`, you need to include that field in the `group by` clause. `distinct` is not an aggregate and would apply to the entire result set... — sgeddes, Aug 30 '19 at 18:42
SQL has its own syntax, with differences between various rdbms and does not follow the syntax of the English language. In this case what you are saying is not translated properly to SQL. Also, does your 2nd query even run? — forpas, Aug 30 '19 at 18:43
"*I am saying for each distinct continent*" - No, that's not what that code means. — melpomene, Aug 30 '19 at 18:45
`distinct` is **NOT** a function. It always applies to all columns in the select list. Putting parentheses around a column name won't change anything. `distinct (a),b` is the same as `distinct a,(b)` or `distinct a,b` — , Aug 30 '19 at 18:53
Distinct isn't a function it's simply a tag that says only return unique values for all columns in a row. So if you have North America, United States listed twice in the data set, it would return only once. Distinct should be used to remove duplicate data that you don't want or need. Group by should be used when using aggregates like sum, count, min, max,avg etc... — xQbert, Aug 30 '19 at 18:54
Consider: https://stackoverflow.com/questions/3408037/distinct-function-not-select-qualifier-in-postgres — xQbert, Aug 30 '19 at 18:59
Notice how that count in the second query isn't exactly a meaningful value (supposing it doesn't error out.) — shawnt00, Aug 30 '19 at 19:14
What is "This does not work" supposed to mean? Are you getting wrong results? Or an error message? If so, which? And what DBMS are you working with? — Thorsten Kettner, Aug 30 '19 at 19:36
On a side note: `world` is a not a good name for this table. The table name should tell what a row in the table represents, which is probably a country in your case. So, `country` or `countries` would be the appropriate name here. — Thorsten Kettner, Aug 30 '19 at 20:05
Another side note: Do you really want to count non-null occurrences of `name`? Can `name` be null? If not, then you only want to count rows. Use the simpler `count(*) then. — Thorsten Kettner, Aug 30 '19 at 20:14

Himanshu · Answer 1 · 2019-08-31T03:11:35.780

GROUP BY AND DISTINCT are very much seperate in one way or the other.

Group by is used specifically to create and perform aggregation per groups while distinct is just used to have distinct/unique records or removing duplicates nothing else.

   SELECT continent, COUNT(name)
     FROM world
   WHERE population>200000000
   GROUP BY continent

The first query has a group by on continent it will group all rows which are having same continent into seperate groups after filtering via where.

This query will give you records of count per each continent

    SELECT DISTINCT continent, 
   COUNT(name)
   FROM world
  WHERE population>200000000

The 2nd query means performing distinct and count on whole table but not groups (note) after filtering population. This query will give you distinct/unique continent but count is independent of groups and is of whole table

score 0 · Answer 2 · answered Aug 30 '19 at 19:03

I think you want:

SELECT continent
   , COUNT(DISTINCT name) AS DistinctCountries
FROM world
WHERE population>200000000
GROUP BY continent

If want each row to represent a continent, you need to group by continent. Then count the distinct countries in the continent where your condition is met.

Thorsten Kettner · Answer 3 · 2019-08-30T20:08:54.867

The first query and its order of evaluation:

FROM world: Get rows from the world table.
WHERE population>200000000: Only accept rows (countries?) with a population greater than 200000000.
GROUP BY continent: Aggregate the rows so as to get one result row per continent.
SELECT COUNT(name): For the continent show the count of its rows found in 3 where name is not null.
SELECT continent: show the continent.

The second query and its order of evaluation:

FROM world: Get rows from the world table.
WHERE population>200000000: Only accept rows (countries?) with a population greater than 200000000.
GROUP BY continent: Aggregate the rows so as to get one result row per continent.
SELECT COUNT(name): As there is no group by clause, this is saying you want one result row only, with the count of all rows found in 3 where name is not null.
SELECT (continent): The parentheses are superfluous. You are saying you want to show the continent. However, as you said with COUNT(name), you wanted to show one result row only, which continent are you talking about? It makes no sense to the DBMS and is invalid SQL. (There is one DBMS making an exception here, though: MySQL would just pick a continent arbitrarily rather than raising an error, a certain setting provided.)
SELECT DISTINCT: Of all result rows, you want duplicates removed, i.e. all rows showing the same continent and count.

Your error, as you can see, is in steps 4 and 5, where SELECT COUNT(name) without GROUP BY and SELECT (continent) don't match semantically.

score 0 · Accepted Answer · answered Aug 30 '19 at 22:48

Why does this not work?

SELECT DISTINCT(continent), COUNT(name)
FROM world
WHERE population > 200000000;

That is simple. You have an aggregation query, because you have COUNT() in the SELECT. You have no GROUP BY, so any other columns references in the SELECT must be the arguments of aggregations columns. So, continent generates an error.

You seem to also be under the impression that the parentheses around continent have some significance. They do not. Not at all. SQL has a construct, SELECT DISTINCT, which selects distinct values of rows.

Also note that DISTINCT is almost never used with aggregation functions.

Group By Vs Distinct in SQL

4 Answers4

The first query and its order of evaluation:

The second query and its order of evaluation: