Sort two csv fields by removing duplicates and without row-by-row processing

Question

I am trying to combine two csv fields, eliminate duplicates, sort and store it in a new field.

I was able to achieve this. However, I encountered a scenario where the values are like abc and abc*. I need to keep the one with abc* and remove the other. Could this be achieved without row by row processing?

Here is what I have.

CREATE TABLE csv_test
    (
        Col1 VARCHAR(100),
        Col2 VARCHAR(100),
        Col3 VARCHAR(500)
    );

INSERT dbo.csv_test (Col1, Col2)
VALUES ('xyz,def,abc', 'abc*,tuv,def,xyz*,abc'), ('qwe,bca,a23', 'qwe,bca,a23*,abc')
--It is assumed that there are no spaces around commas

SELECT Col1, Col2, Col1 + ',' + Col2 AS Combined_NonUnique_Unsorted, 
STUFF((
                 SELECT  ',' + Item
                 FROM     (SELECT DISTINCT Item FROM dbo.DelimitedSplit8K(Col1 + ',' + Col2,',')) t
                 ORDER BY Item
                 FOR XML PATH('')
             ),1,1,'') Combined_Unique_Sorted
, ExpectedResult = 'Keep the one with * and make it unique'
FROM   dbo.csv_test;

--Expected Results; if there are values like abc and abc* ; I need to keep abc* and remove abc ; 
--How can I achieve this without looping or using temp tables?

abc,abc*,def,tuv,xyz,xyz*   ->  abc*,def,tuv,xyz*
a23,a23*,abc,bca,qwe        ->  a23*,abc,bca,qwe

Don't. Read [Is storing a delimited list in a database column really that bad?](http://stackoverflow.com/questions/3653462/is-storing-a-delimited-list-in-a-database-column-really-that-bad), where you will see a lot of reasons why the answer to this question is **Absolutely yes!** You need to normalize your database to get rid if this atrocity. — Zohar Peled, Feb 21 '18 at 07:09
Agreed. The new design considers normalization. However, until then we are stuck at this. So, just trying to help a colleague on this and myself to learn some tricky things here. — Santhoshkumar KB, Feb 21 '18 at 07:30

Zohar Peled · Accepted Answer · 2018-02-21T08:50:02.320

Well, since you agree that normalizing the database is the correct thing to do, I decided to try to come up with a solution for you.
I ended up with quite a cumbersome solution involving 4(!) common table expressions - cumbersome, but it works.

The first cte is to add a row identifier missing from your table - I've used ROW_NUMBER() OVER(ORDER BY Col1, Col2) for that.
The second cte is to get a unique set of values from combining both csv columns. Note that this does not handle the * part yet.
The third cte is handling the * issue.
And finally, the fourth cte is putting all the unique items back into a single csv. (I could do it in the third cte but I wanted to have each cte responsible of a single part of the solution - it's much more readable.)

Now all that's left is to update the first cte's Col3 with the fourth cte's Combined_Unique_Sorted:

;WITH cte1 as
(
    SELECT  Col1, 
            Col2, 
            Col3,
            ROW_NUMBER() OVER(ORDER BY Col1, Col2) As rn
    FROM dbo.csv_test
), cte2 as
(
    SELECT rn, Item
    FROM cte1
    CROSS APPLY 
    (
        SELECT DISTINCT Item 
        FROM dbo.DelimitedSplit8K(Col1 +','+ Col2, ',') 
    ) x
), cte3 AS
(
    SELECT rn, Item
    FROM cte2 t0
    WHERE NOT EXISTS
    (
        SELECT 1
        FROM cte2 t1
        WHERE t0.Item + '*' = t1.Item
        AND t0.rn = t1.rn
    )
), cte4 AS
(
    SELECT rn, 
           STUFF
           ((
             SELECT  ',' + Item
             FROM cte3 t1
             WHERE t1.rn = t0.rn
             ORDER BY Item
             FOR XML PATH('')
           ), 1, 1, '') Combined_Unique_Sorted
    FROM cte3 t0
)

UPDATE t0
SET Col3 = Combined_Unique_Sorted
FROM cte1 t0
INNER JOIN cte4 t1 ON t0.rn = t1.rn

To verify the results:

SELECT *
FROM csv_test
ORDER BY Col1, Col2

Results:

Col1            Col2                    Col3
qwe,bca,a23     qwe,bca,a23*,abc        a23*,abc,bca,qwe
xyz,def,abc     abc*,tuv,def,xyz*,abc   abc*,def,tuv,xyz*

You can see a live demo on rextester.

Sort two csv fields by removing duplicates and without row-by-row processing

1 Answers1