0

I'm using spark version 3.2.1 on databricks (DBR 10.4 LTS), and I'm trying to convert sql server query into a databricks sql query that runs on a spark cluster using spark sql in sql syntax. However, spark sql does not seem to support XML PATH as a SQL function and I wonder if there is an alternative way to convert this sql server query into a sql query that spark sql will accept and run. The original sql server query looks like this and when I run it in databricks:

%sql
DROP TABLE if exists UserCountry;
CREATE TABLE if not exists UserCountry (
  UserID INT,
  Country VARCHAR(5000)
);
INSERT INTO UserCountry
SELECT 
    L.UserID AS UserID,
    COALESCE(
        STUFF(
            (SELECT ', ' + LC.Country FROM UserStopCountry LC WHERE L.UserID = LC.UserID FOR XML PATH (''))
            , 1, 2, '')
        , '') AS Country
FROM 
LK_ETLRunUserID L

When I run the query above in databricks spark sql, I get the following error:

ParseException: 
mismatched input 'FOR' expecting {')', '.', '[', 'AND', 'BETWEEN', 'CLUSTER', 'DISTRIBUTE', 'DIV', 'EXCEPT', 'GROUP', 'HAVING', 'IN', 'INTERSECT', 'IS', 'LIKE', 'ILIKE', 'LIMIT', NOT, 'OR', 'ORDER', 'QUALIFY', RLIKE, 'MINUS', 'SORT', 'UNION', 'WINDOW', EQ, '<=>', '<>', '!=', '<', LTE, '>', GTE, '+', '-', '*', '/', '%', '&', '|', '||', '^', ':', '::'}(line 6, pos 80)

== SQL ==
INSERT INTO UserCountry
SELECT 
    L.UserID AS UserID,
    COALESCE( 
        STUFF( 
            (SELECT ', ' + LC.Country FROM UserStopCountry LC WHERE L.UserID = LC.UserID FOR XML PATH ('')) 
--------------------------------------------------------------------^^^
            , 1, 2, '')
        , '') AS Country
FROM 
LK_ETLRunUserID L

Given that the UserStopCountry looks like this:

UserID Country
1 'US'
1 'US'
2 'MEXICO'
2 'US'
3 'US'

I believe the output will be:

UserID Country
1 'US', 'US'
2 'MEXICO', 'US'
3 'US'
Charlieface
  • 52,284
  • 6
  • 19
  • 43
user9532692
  • 584
  • 7
  • 28
  • 1
    Are you not on a recent version of SQL Server? Why not use `STRING_AGG`, as all supported versions of SQL Server support the syntax. – Thom A Mar 27 '23 at 13:51
  • Hmm does databricks sql accept STRING_AGG function? Do you happen to know how STRING_AGG could be used here to replace FOR XML PATH? I believe the sql server query I shared in the post was written a few years ago when STRING_AGG was not available. – user9532692 Mar 27 '23 at 14:08
  • `STRING_AGG` was added in SQL Server 2017, which I why asked if you are not using a recent version of SQL Server. it's been around, now, for 6 years. – Thom A Mar 27 '23 at 14:09
  • Would you mind sharing how `STRING_AGG` could be used here to replace FOR XML PATH? – user9532692 Mar 27 '23 at 14:14
  • What version of SQL Server are you actually using? – Thom A Mar 27 '23 at 14:15
  • Well, I'm not actually using SQL Server at the moment. I'm trying to convert sql server query into spark sql query, and I'm not sure if spark sql will accept STRING_AGG to be honest, but I might as well give it a try. So to you question, I think we could try version 15.0, which was release around 2019 – user9532692 Mar 27 '23 at 15:23
  • That would explain why what you have isn't working. I'm pretty sure `FOR XML PATH` is proprietary to SQL Server. You can't just take SQL written in one dialect, for one database engine, and dump it into another and expect it to "just work". You should be searching for how to do string aggregation in the product you are using. – Thom A Mar 27 '23 at 15:37
  • Yes that's what I was trying to ask in my post - how to use something like "FOR XML PATH" or "STRING_AGG" in Spark SQL. I apologize if that wasn' clear in my post. – user9532692 Mar 27 '23 at 16:48

0 Answers0