1

This follows on from Passing a varchar full of comma delimited values to a SQL Server IN function.

I want to split some comma-separated text but I need to allow for embedded commas:

  DECLARE @text NVARCHAR(1000) = 'abc,def,"ghi,jkl",mno';

The results I'm expecting are:

  abc
  def
  ghi,jkl
  mno

Here's the function I use to split CSV text.

It uses a loop so if performance is an issue you can adapt it using the suggestions here: https://stackoverflow.com/a/878964/482595

CREATE FUNCTION uf_Split
( 
    @Text NVARCHAR(MAX), 
    @Delimiter CHAR(1),
    @Quote CHAR(1)
) 
RETURNS @Result TABLE 
( 
    [Index] INT NOT NULL IDENTITY(1, 1), 
    [Value] NVARCHAR(4000) NULL,
    [CharPos] INT
)
AS 
BEGIN 
    DECLARE @start BIGINT; SET @start = 1
    DECLARE @end BIGINT; SET @end = 1

    IF @Text is null
    BEGIN
      RETURN
    END 

    WHILE 1=1 
    BEGIN 
        SET @end = 
            CASE
                WHEN CHARINDEX(@Quote, @Text, @start) = @start THEN CHARINDEX(@Quote + @Delimiter, @Text, @start + 1)
                ELSE CHARINDEX(@Delimiter, @Text, @start)
            END

        IF ISNULL(@end, 0) = 0 
        BEGIN 
            -- Delimiter could not be found in the remainder of the text:
            INSERT @Result([Value], [CharPos]) VALUES(SUBSTRING(@Text, @start, DATALENGTH(@Text)), @start)
            BREAK 
        END 
        ELSE IF (CHARINDEX(@Quote, @Text, @start) = @start) AND (CHARINDEX(@Quote + @Delimiter, @Text, @start + 1) = @end)
        BEGIN
            INSERT @Result([Value], [CharPos]) VALUES(SUBSTRING(@Text, @start + 1, @end - @start - 1), @start)
            SET @start = @end + 2
        END
        ELSE 
        BEGIN
            INSERT @Result([Value], [CharPos]) VALUES(SUBSTRING(@Text, @start, @end - @start), @start)
            SET @start = @end + 1
        END
    END 

    RETURN
END
GO
Community
  • 1
  • 1
sean
  • 105
  • 1
  • 6
  • That's a non-trivial proposition in most languages. It is extremely hard to do with regular expressions. You are normally best of with code/functions specifically designed to handle splitting CSV format. It may be doable; but it probably won't be fun. – Jonathan Leffler May 23 '12 at 04:35

3 Answers3

4

I took a shot using a two-phase split in t-sql. I am definitely interested to see how others approach this one. If these strings are large or if you wish to process large rowsets I would look into other options, perhaps BULK INSERT or CLR.

declare @data nvarchar(1000) = 'abc,def,"ghi,jkl",mno,"yak","yak,123"';


declare @x xml;
select  @x = cast('<d>' + replace(@data, '"', '</d><d>') + '</d>' as xml);

;with c(d,i)
as  (   select  p.n.value('.', 'nvarchar(max)') AS data,
                case
                    when left(p.n.value('.', 'nvarchar(max)'), 1) = ',' then 1
                    when right(p.n.value('.', 'nvarchar(max)'), 1) = ',' then 1 
                    else 0 
                end
        from    @x.nodes('/d') p(n)
    )       
select  d
from    c
where   i = 0 and len(d) > 0
union all
select  p.n.value('.', 'nvarchar(max)')
from    (   select cast('<d>' + replace(d, ',', '</d><d>') + '</d>' as xml)
            from c
            where   i=1
        ) d(x)
cross
apply   d.x.nodes('/d')p(n)
where   len(p.n.value('.', 'nvarchar(max)')) > 0;
nathan_jr
  • 9,092
  • 3
  • 40
  • 55
  • This solution actually performs much better than my solution - splits 17,000 items in 5 sec. – sean May 23 '12 at 22:58
  • If you swapped the XML approach I used for a method based on a number table (like, in your link) you would likely get even better performance. – nathan_jr May 23 '12 at 23:00
0

the best way to do this to define special case for embedded commas in your function ,when you are splitting check for embedded commas in start of string and remove that substring.

Buzz
  • 6,030
  • 4
  • 33
  • 47
0

Great solution, thanks for sharing! I did make one change to fit my situation. Some of our data has embedded &, which causes an illegal character error. To overcome the issue, I used a REPLACE function to change it from & to &amp; so it could be parsed in the XML. Then later when I need to change it back I replace the value with &amp; with &. I'm sure there's a better way to do it but definitely addressed our issue. Below is you code sample with my changes.

DECLARE @data NVARCHAR(MAX) = 'abc,def,"ghi,jkl",mno,"yak","yak,123","zzzz & yyyy"';

SELECT @data = REPLACE(REPLACE(REPLACE(@data, '&', '&amp;'), '<', '&lt;'), '>', '&gt;'); /**** Replace characters (&, <, >) ****/

DECLARE @x XML = CAST('<d>' + REPLACE(@data, '"', '</d><d>') + '</d>' AS XML);

;WITH c (d,i) AS
    (
        SELECT  p.n.value('.', 'NVARCHAR(MAX)') AS DATA,
                CASE
                    WHEN LEFT(p.n.value('.', 'NVARCHAR(MAX)'), 1) = ','
                        THEN 1
                    WHEN RIGHT(p.n.value('.', 'NVARCHAR(MAX)'), 1) = ','
                        THEN 1 
                    ELSE 0 
                END
        FROM @x.nodes('/d') p(n)
    )

SELECT REPLACE(REPLACE(REPLACE([z].[d], '&amp;', '&'), '&lt;', '<'), '&gt;', '>') AS d /**** Restore characters (&, <, >) ****/
FROM (
        SELECT d
        FROM c
        WHERE i = 0
            AND LEN(d) > 0

        UNION ALL

        SELECT p.n.value('.', 'NVARCHAR(MAX)')
        FROM (
                SELECT cast('<d>' + replace(d, ',', '</d><d>') + '</d>' AS XML)
                FROM c
                WHERE i = 1
            ) d(x)
        CROSS APPLY d.x.nodes('/d')p(n)
        WHERE LEN(p.n.value('.', 'NVARCHAR(MAX)')) > 0
    ) AS z
ORDER BY d;
j3ff
  • 5,719
  • 8
  • 38
  • 51