How to implement collect_set functionality from pyspark into SQL Server?

Question

I have a column that I am generating using the STRING_AGG function.

STRING_AGG(CONVERT(NVARCHAR(MAX), ISNULL(T1.id,'N/A')), ',') AS old_id_list

This returns a list of all the values in group by aggregrated into a list.

My original table looks like this:

new_id	old_id	amount
a	1	10
a	1	20
a	2	30
a	2	40
a	3	50

On applying the above string_agg call, I get an output like this:

new_id	old_id_list	amount_total
a	1,1,2,2,3	150

But I want to remove the repeated id while not disturbing the total amount computed column.

Expected output:

new_id	old_id_list	amount_total
a	1,2,3	150

Things I have found over the internet were using distinct and ARRAY_AGG function but SQL Server does not have ARRAY_AGG function. I cannot remove the repeated old_id before the string_agg() as it will change the total amount computation.

I tried to insert distinct keyword into the string_agg function but it didn't work either.

'ARRAY_AGG' is not a recognized built-in function name.

TLDR: I am trying to implement collect_set() functionality from pyspark in SQL Server.

https://spark.apache.org/docs/3.2.0/api/python/reference/api/pyspark.sql.functions.collect_set.html

I'm using SQL Server 2019 (v15.0.2095.3)

[Please do not upload images of code/data/errors when asking a question.](//meta.stackoverflow.com/q/285551) — Thom A, Jan 23 '23 at 17:28
Also, the version of SSMS you are using is some what irrelevant. Using SSMS 19 means you are could be using anything including SQL Server 2008-2022 (That's 8 different versions alone), Azure SQL Edge, Azure SQL Database, Azure Managed Instance, Azure Synapse, and I'm probably missing something else. The (R)DBMS product and versionis what's important, not the IDE environment you are using. — Thom A, Jan 23 '23 at 17:30
The function *isn't* called `ARRAY_AGG`, it's called `STRING_AGG`... The error is telling you the problem; there *isn't* built-in function called `ARRAY_AGG`. — Thom A, Jan 23 '23 at 17:32
@Larnu I was using https://stackoverflow.com/questions/57571206/sql-distinct-values-per-group-how-to-group-by-and-get-a-list-of-distinct-val as a reference and https://learn.microsoft.com/en-us/u-sql/functions/aggregate/array-agg . I am new to SQL and exploring stuff. I addressed the issue regarding SQL server version. Thank you — Sagar Moghe, Jan 23 '23 at 17:36
The documentation you've linked is for U-SQL, not T-SQL. U-SQL is used by Azure Data Lake Analytics service(s). The *question* you've linked is for Google BigQuery; a completely different product made by a completely different vendor. — Thom A, Jan 23 '23 at 17:39
@Larnu, I agree I just stated the approaches I tried till now. If you see the second line of the question I have used the STRING_AGG function. I just posted them so that it does not seem I am asking question without trying anything first. When I was first trying it I didnt know the difference between T_SQL and U-SQL. I just searched for MS_SQL and tried the pieces of code that looked familiar. — Sagar Moghe, Jan 23 '23 at 17:45
Does this answer your question? [Get unique values using STRING\_AGG in SQL Server](https://stackoverflow.com/questions/50589064/get-unique-values-using-string-agg-in-sql-server) — Thom A, Jan 23 '23 at 17:46
@Larnu, I visited this answer as well, It mentions removing duplicates before performing the string_agg function. If I do that then the total amount aggregration will fail as even if the column has duplicated id it has a valid amount that I would still like to keep. — Sagar Moghe, Jan 23 '23 at 17:55
So instead of `DISTINCT`, aggregate into the distinct groups. — Thom A, Jan 23 '23 at 17:57
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/251345/discussion-between-sagar-moghe-and-larnu). — Sagar Moghe, Jan 23 '23 at 17:58
@Larnu, it will be difficult for me to perform this as there are other order_id and invoice_id that have to go through same procedure that are other hidden columns in this table view. I am looking for a collect_set() type of functionality if it exist. — Sagar Moghe, Jan 23 '23 at 18:05
You can use cross-apply to return a distinct list from which to aggregate. — Stu, Jan 23 '23 at 18:07

score 0 · Answer 1 · answered Jan 23 '23 at 20:05

As I mentioned, pre-aggregate into distinct groups, and then string aggregate:

SELECT new_id,
       STRING_AGG(old_id,',') AS old_id_list,
       SUM(amount) AS amount_total
FROM (SELECT new_id,
             old_id,
             SUM(amount) AS amount
      FROM dbo.YourTable
      GROUP BY new_id,
               old_id) YT
GROUP BY new_id;

How to implement collect_set functionality from pyspark into SQL Server?

1 Answers1