0

Using SQL Server 2022, I'm trying to figure out how to use STRING_SPLIT in a set-based way to convert a table column where all rows have a variable number of key/value pairs into a table without duplicates, similar to T-SQL split string but not quite.

create table #mashed (mashedtext nvarchar(1600))

insert into #mashed 
values ('"id":"one", "code":"aaa", "dev":"yes"')
insert into #mashed 
values ('"id":"two", "code":"bbb", "dev":"yes"')
insert into #mashed 
values ('"id":"three", "code":"ccc", "dev":"no"')
insert into #mashed 
values ('"id":"three", "code":"bbb", "help":"no" , "rid":"6"')

Desired shape of output

   key    value
----------------
   id     one
   id     two
   id     three
   code   aaa
   code   bbb
   code   ccc
   dev    yes
   dev    no
   help   no
   rid    6

This is clearly wrong:

SELECT value 
FROM STRING_SPLIT (SELECT mashedtext FROM #mashed, ',') 

So what is the right way to do this without RBAR? I am sure there are two splits needed, once to get the pairs and again on each pair.

Thanks.

Dale K
  • 25,246
  • 15
  • 42
  • 71
Snowy
  • 5,942
  • 19
  • 65
  • 119
  • 1
    Do you have a comma between `"bbb"` and `help"` (`"code":"bbb" "help":"no"`)? The idea is to parse the text as JSON: `SELECT DISTINCT j.[key], j.[value] FROM #mashed m CROSS APPLY OPENJSON(CONCAT('{', m.mashedtext, '}')) j` – Zhorov Aug 01 '23 at 06:41
  • @Zhorov yes I was missing comma, now editing original q to include it, thanks. – Snowy Aug 01 '23 at 13:37

2 Answers2

2

The stored text data is almost a valid JSON object, so you may try to fix it and parse it with OPENJSON():

SELECT DISTINCT j.[key], j.[value]
FROM #mashed m
CROSS APPLY OPENJSON(CONCAT('{', m.mashedtext, '}')) j

If you want to use STRING_SPLIT() you need to use the third optional parameter in STRING_SPLIT() (introduced in SQL Server 2022) to get the ordinal position of the each substring:

SELECT DISTINCT s2.*
FROM #mashed m
CROSS APPLY STRING_SPLIT(m.mashedtext, ',') s1
CROSS APPLY (
   SELECT
      [key] = MAX (REPLACE(CASE WHEN [ordinal] = 1 THEN [value] END, '"', '')),
      [value] = MAX (REPLACE(CASE WHEN [ordinal] = 2 THEN [value] END, '"', ''))
  FROM STRING_SPLIT(TRIM(s1.[value]), ':', 1)
) s2

Result (without order):

key value
code aaa
code bbb
code ccc
dev no
dev yes
help no
id one
id three
id two
rid 6
Zhorov
  • 28,486
  • 6
  • 27
  • 52
  • Winning! Is there a smart way to index within the mashedtext column? On a table with 250k rows it takes 13 seconds, this table will have 5m rows at full at runtime. Even if it negatively impact insert performance, extraction speed will be critical. – Snowy Aug 01 '23 at 17:48
  • It turns out some of the keys in the real data set have colons in the key so they look like ....values ('"id":"three:six:nine", "code":"ccc", "dev":"no"')... and some values also have colons and commas, am struggling with converting the inner separators to slashes, what is best way to update that? – Snowy Aug 02 '23 at 02:05
  • 1
    @Snowy, I think the JSON-based approach solves this issue and returns `id` as `key` and `three:six:nine` as `value` (using test data `"id":"three:six:nine", "code":"ccc", "dev":"no"`). – Zhorov Aug 02 '23 at 05:52
1

STRING_SPLIT() to split into rows and then perform string parsing to extract the required key and value

SELECT m.mashedtext, v.[value], kv.[key], kv.[val]
FROM   #mashed m
       CROSS APPLY STRING_SPLIT(mashedtext, ',') s
       CROSS APPLY
       (
           -- remove leading spaces
           select [value] = ltrim(s.[value])
       ) v
       CROSS APPLY
       (     
           -- find position of colon
           select p = charindex(':', v.[value])
       ) p
       CROSS APPLY
       (
           -- get the required key and val
           select [key] = replace(left(v.value, p - 1), '"', ''),
                  [val] = replace(right(v.value, len(v.value) - p), '"', '')
       ) kv
Squirrel
  • 23,507
  • 4
  • 34
  • 32