39

Why would someone use a group by versus distinct when there are no aggregations done in the query?

Also, does someone know the group by versus distinct performance considerations in MySQL and SQL Server. I'm guessing that SQL Server has a better optimizer and they might be close to equivalent there, but in MySQL, I expect a significant performance advantage to distinct.

I'm interested in dba answers.

EDIT:

Bill's post is interesting, but not applicable. Let me be more specific...

select a, b, c 
from table x
group by a, b,c

versus

select distinct a,b,c
from table x
Frank V
  • 25,141
  • 34
  • 106
  • 144
mson
  • 7,762
  • 6
  • 40
  • 70
  • 1
    This is closely related to, but not exactly the same question as, http://stackoverflow.com/questions/164319/is-there-any-difference-between-group-by-and-distinct – Bill Karwin Jan 09 '09 at 02:38
  • Most probably because they didn't know what they were about. If you're sure they knew what they were about, then I'd suspect that there is a difference in the way NULL values are treated between the two - offhand, I can't think what, though. – Jonathan Leffler Jan 09 '09 at 05:31
  • I think the answer is as simple as - the writer didn't know about distinct (which is surprising because I thought he was pro). – mson Jan 09 '09 at 09:59

5 Answers5

29

GROUP BY maps groups of rows to one row, per distinct value in specific columns, which don't even necessarily have to be in the select-list.

SELECT b, c, d FROM table1 GROUP BY a;

This query is legal SQL (correction: only in MySQL; actually it's not standard SQL and not supported by other brands). MySQL accepts it, and it trusts that you know what you're doing, selecting b, c, and d in an unambiguous way because they're functional dependencies of a.

However, Microsoft SQL Server and other brands don't allow this query, because it can't determine the functional dependencies easily. edit: Instead, standard SQL requires you to follow the Single-Value Rule, i.e. every column in the select-list must either be named in the GROUP BY clause or else be an argument to a set function.

Whereas DISTINCT always looks at all columns in the select-list, and only those columns. It's a common misconception that DISTINCT allows you to specify the columns:

SELECT DISTINCT(a), b, c FROM table1;

Despite the parentheses making DISTINCT look like function call, it is not. It's a query option and a distinct value in any of the three fields of the select-list will lead to a distinct row in the query result. One of the expressions in this select-list has parentheses around it, but this won't affect the result.

Bill Karwin
  • 538,548
  • 86
  • 673
  • 828
  • 1
    interesting but irrelevant. you are answering a question about making mistakes in queries. also, i don't think it is ansi standard to accept such a query as you pose. if anything, it's a bug in mysql - the other major vendors do not support this 'feature' – mson Jan 09 '09 at 03:11
  • Getting off-topic, but the second query Bill gives (SELECT DISTINCT(a)...) is perfectly valid ANSI SQL-92. The parentheses are actually irrelevant; you can do SELECT a, (b), c FROM table1, and that's valid. It's just that because it's the first param it LOOKS like it the 'a' is a param of DISTINCT. – Cowan Jan 09 '09 at 04:26
  • 1
    @mson: You're right! I looked it up in my "SQL-99 Complete, Really" and it says enforcement of the Single-Value Rule is standard SQL and an RDBMS that doesn't enforce it (e.g. MySQL) is not conforming to standard SQL. – Bill Karwin Jan 09 '09 at 07:28
  • @Cowan: Yes that was my point. It's valid syntax, and it does exactly what it's supposed to, but that's different from what some people believe it does. – Bill Karwin Jan 09 '09 at 07:32
  • @Bill - thanks for researching and finding out for sure (i have a terrible memory and was guessing). – mson Jan 09 '09 at 09:54
  • 1
    FWIW, SQLite also does not enforce the Single-Value Rule. Also, MySQL will enforce the rule if you set the `ONLY_FULL_GROUP_BY` SQL mode. – Bill Karwin Jan 26 '10 at 01:02
18

A little (VERY little) empirical data from MS SQL Server, on a couple of random tables from our DB.

For the pattern:

SELECT col1, col2 FROM table GROUP BY col1, col2

and

SELECT DISTINCT col1, col2 FROM table 

When there's no covering index for the query, both ways produced the following query plan:

|--Sort(DISTINCT ORDER BY:([table].[col1] ASC, [table].[col2] ASC))
   |--Clustered Index Scan(OBJECT:([db].[dbo].[table].[IX_some_index]))

and when there was a covering index, both produced:

|--Stream Aggregate(GROUP BY:([table].[col1], [table].[col2]))
   |--Index Scan(OBJECT:([db].[dbo].[table].[IX_some_index]), ORDERED FORWARD)

so from that very small sample SQL Server certainly treats both the same.

Cowan
  • 37,227
  • 11
  • 66
  • 65
  • thanks for confirming - i'd guess that the big vendors have optimized this, but i'd also guess that MySQL has not. – mson Jan 09 '09 at 09:57
3

In MySQL I've found using a GROUP BY is often better in performance than DISTINCT.

Doing an "EXPLAIN SELECT DISTINCT" shows "Using where; Using temporary " MySQL will create a temporary table.

vs a "EXPLAIN SELECT a,b, c from T1, T2 where T2.A=T1.A GROUP BY a" just shows "Using where"

karl
  • 307
  • 1
  • 11
  • you always should check the EXPLAIN of your query in question. Today I had a query which was "USING temporary; USING filesort" with group by but only "USING temporary" with DISTINCT. therefore the DISTINCT variant was 4 times faster. (mysql 5.7) – staabm Jul 26 '17 at 10:24
2

Both would generate the same query plan in MS SQL Server.... If you have MS SQL Server you could just enable the actual execution plan to see which one is better for your needs ...

Please have a look at those posts:

http://blog.sqlauthority.com/2007/03/29/sql-server-difference-between-distinct-and-group-by-distinct-vs-group-by/

http://www.sqlmag.com/Article/ArticleID/24282/sql_server_24282.html

Andre Gallo
  • 2,261
  • 5
  • 23
  • 46
  • using group by instead of distinct smells like newb. there is a lot more syntax for no gain. also - you don't have any metrics - just hearsay from another's blog. – mson Jan 09 '09 at 01:48
0

If you really are looking for distinct values, the distinct makes the source code more readable (like if it's part of a stored procedure) If I'm writing ad-hoc queries I'll usually start with the group by, even if I have no aggregations because I'll often end up putting them on.

Booji Boy
  • 4,522
  • 4
  • 40
  • 45
  • i do the same thing you do, but in the other direction. i start with select distinct and move to group by if there is a reason to. i don't try to optimize everything, but if i can remember a few rules of thumb about what is faster and cleaner, i follow them. group by is a bit messy-2 clauses – mson Jan 09 '09 at 09:56