SQL: Aggregate DISTINCT vs. GROUP BY Columns

Question

Apologies if there is a duplicate question. I couldn't find one.

This is in SQL Server (Azure Synapse)

I have a table with 150+ columns and I am trying to get an aggregate COUNT( DISTINCT ) over a few of them.

Here is (a subset of) my query:

SELECT *
    ,count(DISTINCT XXX) AS Invoice_Count
    ,count(DISTINCT YYY) AS PO_Count
    ,count(DISTINCT ZZZ) AS PO_Item_Count
    ,'INV_WT_PO' AS Source_Identifier
    ,concat(XXX, YYY) AS Reference_Key
FROM [XXX].[YYY]

But I am getting the usual error:

XXX is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.

I understand the obvious solution would be to mention all the 150+ columns in the GROUP BY clause.

I did that and it worked but the performance is ABYSMAL. 1 minute 40 seconds vs. under 1 second for the query without the aggregate columns and GROUP BY clause.

I am curious if there is a better way to address this?

Thanks.

Your current query returns a single row summary over the _entire_ table. Adding `concat(XXX, YYY)` to the select clause makes no sense. I suggest adding sample data to your question. — Tim Biegeleisen, Jul 05 '23 at 02:40
Could you please provide sample data instead of XXX and YYY? — Bhavani, Jul 05 '23 at 03:29
you can use [OVER clause](https://learn.microsoft.com/en-us/sql/t-sql/queries/select-over-clause-transact-sql?view=sql-server-ver16). But you can do `COUNT(DISTINCT ) OVER( ) `, see [work around](https://stackoverflow.com/questions/11202878/partition-function-count-over-possible-using-distinct) — Squirrel, Jul 05 '23 at 03:41
Take out the asterisk as the first step. `select *, count(distinct xxx) ...` means you are attempting to display all 150 columns AND do some aggregation at the same time. Aggregation implies summarization, is 150 raw columns a summary? — Paul Maxwell, Jul 05 '23 at 04:23
Count distinct is a heavy operation because it builds a lot of table spools. If youre doing it mechanically for 150 columns, you're probably doing something wrong. For example, you can instead preaggregate your data by using the regular count before joining all other tables which forces you to use a distinct count. A classic example is when you build an invoice of head and rows. One can count the head data first and then join rows for whatever u need — siggemannen, Jul 05 '23 at 06:59

Bhavani · Answer 1 · 2023-07-05T08:57:18.010

0

I replicated the issue in my environment with sample data.

I have run the given code according to my data I got the same error.

enter image description here

I tried with below code without using group by clause.

SELECT *,
       (SELECT COUNT(DISTINCT XXX) FROM [XXX].[YYY]) AS Invoice_Count,
       (SELECT COUNT(DISTINCT YYY) FROM [XXX].[YYY]) AS Age_Count,
       (SELECT COUNT(DISTINCT ZZZ) FROM [XXX].[YYY]) AS Name_Count,
       'INV_WT_PO' AS Source_Identifier,
       CONCAT(XXX, YYY) AS Reference_Key
FROM [XXX].[YYY]

Output:

enter image description here

If you are having more than 150 columns as @Squirrel said you can try with COUNT(DISTINCT ) OVER( ).

edited Jul 05 '23 at 08:57

answered Jul 05 '23 at 04:37

Bhavani

1,725
1
3
6

As per the question guide, please do not post images of code, data, error messages, etc. - copy or type the text into the question. Please reserve the use of images for diagrams or demonstrating rendering bugs, things that are impossible to describe accurately via text. – Dale K Jul 05 '23 at 05:24
How do you recon that is going to perform for 150+ sub-queries? Surely window functions would be a better choice? – Dale K Jul 05 '23 at 05:25
@Dale K I Edited the post according to your suggestion and added detail explanations of execution. – Bhavani Jul 05 '23 at 08:28

SQL: Aggregate DISTINCT vs. GROUP BY Columns

1 Answers1