Join by concatenated field (in CSV format)

Question

UPDATE: I am fully aware that this is a poor RDBMS practice, but the question is not asking whether it is and how I can re-train the DBAs who created this architecture. The question is how I can work around the situation that I have on hands. I appreciate the help of the community and must admit that this is an interesting problem indeed.

In SQL Server 2017, I have a lookup table containing codes and a transactions table with CSV-formated codes:

CREATE TABLE #t(cd VARCHAR(100))
CREATE TABLE #cd (id INT, cd VARCHAR (1000))

INSERT INTO #t SELECT 'c1'
INSERT INTO #t SELECT 'c1,c2'
INSERT INTO #t SELECT 'c1,c2,c3'

INSERT INTO #cd SELECT 10, 'c1'
INSERT INTO #cd SELECT 20, 'c2'
INSERT INTO #cd SELECT 30, 'c3'

So, the lookup is

id  cd
10  c1
10  c1
20  c2
30  c3

and, the transactions table has:

cd
c1
c1,c2
c1,c2,c3

I need to replace the codes to their respective IDs, while keeping these in CSV format.

I would like to avoid the cursor because it is too slow. Is there a way to parse the codes, do the JOIN, and recombine the IDs somehow efficiently? I suppose COALESCE may be of use, but need help applying it. Perhaps, there is already a function in t-SQL that does this types of lookups.

The output needs to another column in transactions table:

id
10
10,20
10,20,30

It's **extremely** poor practice to put delimited data into a single column. Fix your schema, and this suddenly becomes an easy problem. — Joel Coehoorn, Apr 04 '18 at 19:29
This must be a really common homework assignment. This is at least the third question today with this really strange request. — Sean Lange, Apr 04 '18 at 19:31
@nicomp and IP address. The delimiters in those fields are ways of formatting a single value. — Joel Coehoorn, Apr 04 '18 at 19:36
@Joel: I understand this is a poor practice, but teaching me a lesson is not helpful in solving the problem. I am not the developer of this table and I have no control over changing it. However, I must use it in my ML projects and need community's help. Please refrain form judging my question unfairly. — Oleg Melnikov, Apr 04 '18 at 19:36
@Sean, Joel, these are neither the HW assignments nor the Zip codes. I have the real and messy data. I simplified the question a lot. Please do not rush into grading me for the question. Instead, kindly help out. Currently I have a solution in Python, but is there an easy way to do this in SQL? This is, in fact, a very interesting question. Joel, yes, I've seen phone numbers stored this way before, but these are technical codes in my case. Their meaning is irrelevant. — Oleg Melnikov, Apr 04 '18 at 19:40
@JoelCoehoorn Not really. A phone number is an area code, prefix, and line number. — nicomp, Apr 04 '18 at 19:44
It just strikes me as funny that today as many as three people are trying to join delimited data like this. The data itself is irrelevant, the issue is the data structure violates 1NF with multiple values in a single tuple. — Sean Lange, Apr 04 '18 at 19:44
@Sean. LOL. Should I delete the question and ask about it tomorrow? Would that break the trend? This question has been bugging me for several months. Surely I can wait a week, if that attracts a positive spirit :) — Oleg Melnikov, Apr 04 '18 at 19:47
Here it is. https://stackoverflow.com/questions/49652929/multiple-tables-in-the-where-clause-sql This is pretty much exactly the same thing you are trying to do. The join is the part you could leverage. Or you could use a string splitter to turn this into something more usable. Here is my first choice of splitters. http://www.sqlservercentral.com/articles/Tally+Table/72993/ Several other excellent options here. https://sqlperformance.com/2012/07/t-sql-queries/split-strings — Sean Lange, Apr 04 '18 at 19:59
@Sean. That's interesting, but is a different problem (needing the substring lookups). In my case, I need the joins. Actually, I can use CROSS APPLY to split the data, then JOIN, then COALESCE. That's it. https://stackoverflow.com/questions/5493510/turning-a-comma-separated-string-into-individual-rows — Oleg Melnikov, Apr 04 '18 at 20:05

DhruvJoshi · Accepted Answer · 2018-04-04T21:14:17.780

You can first strip out comma into a list and then join and get correct ids for codes and then add them back with commas. I used a row_number upfront to get a unique thing to join back on in my query.

See live demo

CREATE TABLE #t(cd VARCHAR(100))
CREATE TABLE #cd (id INT, cd VARCHAR (1000))

INSERT INTO #t SELECT 'c1'
INSERT INTO #t SELECT 'c1,c2'
INSERT INTO #t SELECT 'c1,c2,c3'

INSERT INTO #cd SELECT 10, 'c1'
INSERT INTO #cd SELECT 20, 'c2'
INSERT INTO #cd SELECT 30, 'c3'


; WITH X AS 
(
    SELECT 
     C.id,P1.rn
    FROM
     (
     SELECT *, row_number() over( order by (select 1)) rn,
     cast('<X>'+replace(P.cd,',','</X><X>')+'</X>' as XML) AS xmlitems FROM #t P
     )P1
     CROSS APPLY
     ( 
     SELECT fdata.D.value('.','varchar(100)') AS splitdata 
     FROM P1.xmlitems.nodes('X') AS fdata(D)) O
     LEFT JOIN #cd C
     ON C.cd=  LTRIM(RTRIM(O.splitdata ))
    ) 

SELECT 
    rn,
    id= STUFF((
  SELECT ',' + cast(id as varchar(100)) FROM X AS x2 
  WHERE x2.rn = x.rn
  ORDER BY rn FOR XML PATH, 
  TYPE).value(N'.[1]',N'varchar(max)'), 1, 1, '')
  FROM 
  X
GROUP BY rn

Note: With SQL server 2017 you can also you SPLIT_STRING() function and STRING_AGG() functions

SQL SERVER 2017 code:

select 
 id=STRING_AGG(id,',')
 from
(
  select V=value, rn
  from
      (
        select 
             rn=row_number() over( order by (select 1)), 
             cd 
        from #T
      )T
   cross apply STRING_SPLIT(cd, ',') 
 )  T
left join #cd C
     on cd= v
group by rn

Super! just what I needed. – Oleg Melnikov Apr 04 '18 at 22:02 — Oleg Melnikov, Apr 04 '18 at 22:02

Join by concatenated field (in CSV format)

1 Answers1