DISTINCT with PARTITION BY vs. GROUPBY

Question

I have found some SQL queries in an application I am examining like this:

SELECT DISTINCT
Company, Warehouse, Item,
SUM(quantity) OVER (PARTITION BY Company, Warehouse, Item) AS stock

I'm quite sure this gives the same result as:

SELECT
Company, Warehouse, Item,
SUM(quantity) AS stock
GROUP BY Company, Warehouse, Item

Is there any benefit (performance, readability, additional flexibility in writing the query, maintainability, etc.) of using the first approach over the later?

As I had to mentally parse the first query for a while, it does not score well with "understandability"... — usr, Dec 04 '13 at 12:14
I think [this question](http://stackoverflow.com/questions/2404565/sql-server-difference-between-partition-by-and-group-by) may help clarify the difference — Squirrel5853, Dec 04 '13 at 12:18
@SecretSquirrel I have seen that question but it did not give me quite the information I was looking for. I would like to know if there might some non obvious (to me at least) benefits for using the first query. The second look to me more "natural" as well. — Andris, Dec 04 '13 at 12:23
What I basically meant was from what I have read and understand is that using `GROUP BY` and `PARTITION BY` are not really interchangeable. They both do different things. I would be wary of simply swapping queries over even when they appear to give the same results. — Squirrel5853, Dec 04 '13 at 12:26
The first one calculates a large number of rows to then trow away many of them. The second one is probably more efficient as the group by can do that "on-the-fly" - but only the execution plan can tell. But I agree with `usr`: the first one scores very low on the "readability" scale. — , Dec 04 '13 at 12:27
The latter is definitely the standard way of creating queries such as this, so I would put readability and flexibility on it in that respect. I'm not sure about the performance but that's rather easy to test, use execution planner and execute both queries at once (default shortcut in SSMS is CTRL + M). That way you can see what happens in each query and which takes up more resources than the other. — Kahn, Dec 04 '13 at 13:46
I think it's important to remember partition is an analytical function not agg and the reason for the distinct. — JustDave, Nov 09 '17 at 01:14

Andris · Accepted Answer · 2013-12-05T07:27:38.533

Performance:

Winner: GROUP BY

Some very rudimentary testing on a large table with unindexed columns showed that at least in my case the two queries generated a completely different query plan. The one for PARTITION BY was significantly slower.

The GROUP BY query plan included only a table scan and aggregation operation while the PARTITION BY plan had two nested loop self-joins. The PARTITION BY took about 2800ms on the second run, the GROUP BY took only 500ms.

Readability / Maintainability:

Winner: GROUP BY

Based on the opinions of the commenters here the PARTITION BY is less readable for most developers so it will be probably also harder to maintain in the future.

Flexibility

Winner: PARTITION BY

PARTITION BY gives you more flexibility in choosing the grouping columns. With GROUP BY you can have only one set of grouping columns for all aggregated columns. With DISTINCT + PARTITION BY you can have different column in each partition. Also on some DBMSs you can chose from more aggregation/analytic functions in the OVER clause.

score 3 · Answer 2 · answered Dec 04 '13 at 12:22

Using sum() as an analytic function with over partition by is not necessary. I don't think there is a big difference between them in any sense. In oracle there are lot more analytic function than aggregation function. I think ms-sql is the same case. And for example lag(), lead(), rank(), dense rank(), etc are much harder to implement with only group by. Of course this argument is not really for defending the first version...

Maybe there were previously more computed fields in the result set which are not implementable with group by.

score 0 · Answer 3 · answered Dec 04 '13 at 13:16

Although both queries seem to compute the same thing when you look at the columns, they are actually producing completely different set of rows.

The first one using the analytical function will output exactly one row for each input row. That is for EACH stock information, it will return a row with the total quantity for the associated company/warehouse/item. (by the way computing the average would make more sense to me but who knows...)

The second one will only return a single row for each company/warehouse/item combinaison.

So yes, in that example the first query seems a bit useless... unless you want to compute some stock level statistic like the current stock ratio over the overall quantity by company/warehouse/item (just an example, don't know if it has any business meaning!)

Analytical function are very powerful mechanism in SQL, in some sense way more powerful than a group-by. But use it with care... A simple rule of thumb could be: if you can compute it using a group-by, well, don't use an analytical function ;)

In the first query there is a `DISTINCT` after `SELECT` so it return only one row for each company/warehouse/item like the second. — Andris, Dec 04 '13 at 13:57
Ok, fair enough... still the DISTINCT is applied to every stock row, and need to take into account each value: company/warehouse/item and sum(quantity). If you look at the exec plan (ok, it may depend on your database) the DISTINCT cost adds on the analytical query, which is already a twice as costly as the simple group-by. — SergeFantino, Dec 04 '13 at 14:27

DISTINCT with PARTITION BY vs. GROUPBY

3 Answers3

Performance:

Readability / Maintainability:

Flexibility

Linked