18

I have the following query:

WITH cteCountryLanguageMapping AS (
    SELECT * FROM (
        VALUES
            ('Spain', 'English'),
            ('Spain', 'Spanish'),
            ('Sweden', 'English'),
            ('Switzerland', 'English'),
            ('Switzerland', 'French'),
            ('Switzerland', 'German'),
            ('Switzerland', 'Italian')
    ) x ([Country], [Language])
)
SELECT
    [Country],
    CASE COUNT([Language])
        WHEN 1 THEN MAX([Language])
        WHEN 2 THEN STRING_AGG([Language], ' and ')
        ELSE STRING_AGG([Language], ', ')
    END AS [Languages],
    COUNT([Language]) AS [LanguageCount]
FROM cteCountryLanguageMapping
GROUP BY [Country]

I was expecting the value inside Languages column for Switzerland to be comma separated i.e.:

  | Country     | Languages                                 | LanguageCount
--+-------------+-------------------------------------------+--------------
1 | Spain       | Spanish and English                       | 2
2 | Sweden      | English                                   | 1
3 | Switzerland | French, German, Italian, English          | 4

Instead I am getting the below output (the 4 values are separated by and):

  | Country     | Languages                                 | LanguageCount
--+-------------+-------------------------------------------+--------------
1 | Spain       | Spanish and English                       | 2
2 | Sweden      | English                                   | 1
3 | Switzerland | French and German and Italian and English | 4

What am I missing?


Here is another example:

SELECT y, STRING_AGG(z, '+') AS STRING_AGG_PLUS, STRING_AGG(z, '-') AS STRING_AGG_MINUS
FROM (
    VALUES
        (1, 'a'),
        (1, 'b')
) x (y, z)
GROUP by y

  | y | STRING_AGG_PLUS | STRING_AGG_MINUS
--+---+-----------------+-----------------
1 | 1 | a+b             | a+b

Is this a bug in SQL Server?

Salman A
  • 262,204
  • 82
  • 430
  • 521
Tom Hunter
  • 5,714
  • 10
  • 51
  • 76
  • 1
    Seems to be a bug, it's always returning the 1st STRING_AGG no matter how you write the CASE – dnoeth Sep 27 '18 at 09:23
  • 1
    This is a beauty of an optimizer bug. Simpler and more striking repro: `CASE COUNT([Language]) WHEN 1234567 THEN STRING_AGG([Language], ' and ') ELSE STRING_AGG([Language], ', ') END AS [Languages]` (uses the `1234567` case) and `CASE COUNT([Language]) WHEN 1234567 THEN STRING_AGG([Language], ' and ') END AS [Languages]` (leave out `ELSE` -- now the match fails and the expression becomes `NULL`). No matter what the "correct" result should be, surely that's not it. – Jeroen Mostert Sep 27 '18 at 09:25
  • 1
    Not sure this is appropriate, but....LOL! Since this is not a lack of knowledge on your part, rather than a real thing that many would come across, I strongly recommend changing the question title to be more accurate than a generic "not working as intended", in order to maximize benefit for all. – George Menoutis Sep 27 '18 at 09:35
  • In the resulting execution plan, the second `STRING_AGG` is entirely missing and instead the expression is rebound to the first one, as if the `CASE` had said `STRING_AGG([Language], ' and ')` twice. Any subsequent `CASE`s are absorbed as well. Looks like something very weird is going on with subexpression elimination. – Jeroen Mostert Sep 27 '18 at 09:35
  • 1
    This bug seems to be specifically tuned to `STRING_AGG`. If the `ELSE` is changed to `'blargh' + STRING_AGG(...)`, you'll get `'blarghFrench and German...`, so it improperly unifies the second `STRING_AGG` with the first. The simplest workaround is to change the `ELSE` expression to `STRING_AGG([Language] + '', ', ')` -- this defeats CSE, suggesting there's a bug where CSE ignores the second argument to `STRING_AGG`. – Jeroen Mostert Sep 27 '18 at 09:42
  • @JeroenMostert That workaround sounds like it would make a good answer. – IMSoP Sep 27 '18 at 09:43
  • I'd like to add that the query works as expected in Azure SQL Database so hopefully a fix for the on-prem is in the works. – Dan Guzman Sep 27 '18 at 09:48

1 Answers1

20

Yes, this is a Bug (tm), present in all versions of SQL Server 2017 (as of writing). It's fixed in Azure SQL Server and 2019 RC1. Specifically, the part in the optimizer that performs common subexpression elimination (ensuring that we don't calculate expressions more than necessary) improperly considers all expressions of the form STRING_AGG(x, <separator>) identical as long as x matches, no matter what <separator> is, and unifies these with the first calculated expression in the query.

One workaround is to make sure x does not match by performing some sort of (near-)identity transformation on it. Since we're dealing with strings, concatenating an empty one will do:

SELECT y, STRING_AGG(z, '+') AS STRING_AGG_PLUS, STRING_AGG('' + z, '-') AS STRING_AGG_MINUS
FROM (
    VALUES
        (1, 'a'),
        (1, 'b')
) x (y, z)
GROUP by y
Jeroen Mostert
  • 27,176
  • 2
  • 52
  • 85
  • I allowed myself to post this workaround on [Azure Feedback](https://feedback.azure.com/forums/908035-sql-server/suggestions/35803012-string-agg-multiple-string-agg-columns-defer-to) – Lukasz Szozda Feb 10 '19 at 19:16
  • https://dba.stackexchange.com/a/257819/23463 indicates it is still broken, even in CU17. – Ross Presser Jan 21 '20 at 21:32
  • 1
    @RossPresser: retested it and sure enough, CU17 doesn't actually fix this. Amended the answer. – Jeroen Mostert Jan 22 '20 at 12:01