2

I'm looking for the equivalent of a GROUP_CONCAT() MySQL function in SQL Server 2012 - THAT DOES NOT USE A SUBQUERY, explained below:

CREATE TABLE Temp
( 
ID INT PRIMARY KEY NOT NULL IDENTITY(1,1),
ColA varchar(900) NULL,
ColB varchar(900) NULL
)

INSERT INTO Temp (ColA, ColB)
SELECT 'A', 'some' UNION ALL
SELECT 'A', 'thing' UNION ALL
SELECT 'A', 'and' UNION ALL
SELECT 'B', 'some' UNION ALL
SELECT 'B', 'more' UNION ALL
SELECT 'B', 'and' UNION ALL
SELECT 'B', 'more' UNION ALL
SELECT 'C', 'things' UNION ALL
SELECT 'C', 'things'

-- Desired Output. Note that the lists are in descending order of frequency ('more' appears twice)
ColA, Frequency, ColBs
'B', 4, 'more, some, and'
'A', 3, 'some, thing, and'
'C', 2, 'things'

SELECT 
    ColA, 
    COUNT(*) as Frequency, 
    GROUP_CONCAT(ColB) --Would be nice
FROM Temp
GROUP BY ColA
ORDER BY Frequency DESC

The common answer to this in SQL Server is using STUFF() on a subquery. In my case, the performance is simply unacceptable (200 million records, 26 seconds per subquery * 200 million = 164 years).

SELECT 
    ColA, 
    COUNT(*) as Frequency, 
    ISNULL(
        STUFF((
            SELECT ', ' + ColBs FROM
                (SELECT ColBs, Count(*) as Frequency
                FROM Temp sub
                WHERE sub.ColA = t.ColA
                GROUP BY ColB
                ORDER BY Frequency DESC)
            FOR XML PATH('')
        ), 1, 2, '')
    ), '') as ColBs --Would take 164 years on the entire data set
FROM Temp t
GROUP BY ColA
ORDER BY Frequency DESC

The desired output is the ColB values for each unique ColA, grouped together and in order of descending occurrence, as seen above. However, this needs to be done with a SINGLE QUERY through the table.

Do I need to construct this myself and abandon the 'GROUP BY' call? Iterate through the data set manually and build the new table through in a console application? Or is there something I'm missing?

Ehryk
  • 1,930
  • 2
  • 27
  • 47
  • You could try using a cursor based solution (`LOCAL FAST_FORWARD`). – Bogdan Sahlean Nov 28 '13 at 23:10
  • Could you provide an answer with an example? – Ehryk Nov 28 '13 at 23:13
  • possible duplicate of [How to use GROUP BY to concatenate strings in SQL Server?](http://stackoverflow.com/questions/273238/how-to-use-group-by-to-concatenate-strings-in-sql-server) – Stoleg Nov 29 '13 at 10:36
  • You answer is here: http://stackoverflow.com/questions/273238/how-to-use-group-by-to-concatenate-strings-in-sql-server – Stoleg Nov 29 '13 at 10:36
  • 1
    How do you figure my answer is there? I stated explicitly that I cannot use a subquery due to performance reasons, as every answer there does. – Ehryk Nov 29 '13 at 15:03
  • This might not be the answer you're hoping for but if you're dealing with a dataset large enough to have 200M records _after_ a `GROUP BY` I'd be tempted to say your architecture needs a change. Depending on your setup, that might be caching/aggregating results in a different table every n hours, perhaps offloading the work to Reporting Services or even going whole-hog and using OLAP cubes. Can you provide a little more detail about the underlying problem (how often data changes, what this is used for, how up-to-date lists have to be, etc) to allow us to find the best solution for you? – Basic Dec 05 '13 at 08:25
  • Oh... And as a sanity check... If you omit Grouping entirely and simply select all the values from `ColB` that match your initial WHERE clause, how long does it take (to execute the query)? A Grouped concat can never be faster than that as it needs to do processing after selecting all the appropriate rows. I'd also suggest you look at the Execution Plan to see where all that time is going. Are there table Scans instead of seeks? Are your indexes configured correctly on `ColA`? etc – Basic Dec 05 '13 at 08:34
  • It's 56M after the `GROUP BY`, ~200M before. Not my architecture, but it is static. I can't add any indices as the transaction log makes my SSD run out of drive space before they complete, and it's 5x slower on a HDD. – Ehryk Dec 05 '13 at 10:27
  • 1
    @Stoleg I do not think this question is a duplicate because I specifically asked for a solution that does not use any subqueries. All of the answers in the 'Possible Duplicate' use subqueries. – Ehryk Dec 05 '13 at 11:02
  • Is there a difference for you between CTE and subquery? – Stoleg Dec 05 '13 at 11:59
  • I don't know, honestly. The answer below is using CTEs, and the answers in the question you linked to are using subqueries. I'm looking for an aggregate function that does this (like MySQL's GROUP_CONCAT) or a CLR library that adds such an ability if it cannot be written in SQL Server itself. – Ehryk Dec 05 '13 at 18:56
  • You can write your own custom clr aggregate. Probably if you search you will find an implementation out there already for this. – Martin Smith Mar 01 '14 at 01:26

1 Answers1

2

Try this:

WITH prelim
AS
(
   SELECT
     cola
    ,colb
    ,count(*) AS recs
    ,row_number() over (partition BY cola ORDER BY count(*) DESC ,colb) AS recno
    ,Count(*) over (partition BY cola ) AS cnt
  FROM TEMP
  GROUP BY cola,colb ),
Group_Concat (recno,cnt,recs,cola,colbs)
AS
(
SELECT
    recno
    ,cnt
    ,recs
    ,cola
    ,CAST (colb AS varchar(MAX)) AS colbs
FROM
    prelim
WHERE
    recno=1
UNION ALL
SELECT
    p.recno
    ,p.cnt
    ,g.recs+p.recs
    ,p.cola
    , g.colbs + ', ' + CAST (p.colb AS varchar(MAX)) AS colbs
FROM
    prelim p
    JOIN Group_Concat g ON p.cola=g.cola AND p.recno=g.recno+1
)

SELECT COLA,Recs as Frequency,COLBS 
FROM Group_Concat
where recno=cnt
order by cola
conan
  • 124
  • 5
  • This is working on the given data set, I'll try it on my larger one and if it takes < a few days, I'll mark this as the answer. Thanks! – Ehryk Nov 29 '13 at 15:09
  • If performance is still a problem, the only thing i can think of is to replace the prelim cte with a @table variable with a clustered index on cola and recno. Performance will be very dependent on how much the first group by reduces the overall record count by. Posting from phone so can't post sql atm. Will do so later. – conan Nov 29 '13 at 16:38
  • Doesn't this count as using a subquery in the context of the OP? – Basic Dec 05 '13 at 08:37
  • It does, I was going to check if it was any faster. If it is (can complete in less than a day), I'll still mark it as the answer if no attempts are made to do it without a subquery. I don't know why concatenating aggregate functions aren't built in. – Ehryk Dec 05 '13 at 10:30
  • Technically though, it uses an *additional* query, not a subquery for each row. – Ehryk Dec 05 '13 at 11:02