Oracle How do I transform this string field into structured data using regular expressions?

Question

I did start at this answer: Oracle 11g get all matched occurrences by a regular expression

But it didn't get me far enough. I have a string field that looks like this:

A=&token1&token2&token3,B=&token2&token3&token5

This is proving rather difficult to do.

I started here:

SELECT token from 
  (SELECT REGEXP_SUBSTR(str, '[A-Z=&]+', 1, LEVEL) AS token
    FROM (SELECT 'A=&token1&token2&token3,B=&token2&token3&token5' str from dual)
  CONNECT BY LEVEL <= LENGTH(REGEXP_REPLACE(str, '[A-Z=&]+', ',')))
  Where token is not null

But that yields:

A=&
&
&
B=&
&
&

which is getting me nowhere. I'm thinking I need to do a nested clever select where the first one gets me

A=&token1&token2&token3
B=&token2&token3&token5

And a subsequent select might be able to do a clever extract to get the final result.

Stumped. I'm trying to do this without using procedural or function code -- I would like the set to be something I can union with other queries so if it's possible to do this with nested selects that would be great.

UPDATE:

SET DEFINE OFF
SELECT SUBSTR(token,1,1) as Key, REGEXP_SUBSTR(token, '&\w+', 1, LEVEL) AS token2
FROM
(
-- 1 row per key/value pair
  SELECT token from 
  (SELECT REGEXP_SUBSTR(str, '[^,]+', 1, LEVEL) AS token
    FROM (SELECT 'A=&token1&token2&token3,B=&token2&token3&token5' str from dual)
  CONNECT BY LEVEL <= LENGTH(REGEXP_REPLACE(str, '[^,]+', ',')))
  Where token is not null
)
CONNECT BY LEVEL <= LENGTH(REGEXP_REPLACE(token, '&\w+'))

This gets me

Which is fantastic formatting except for the small problem that it's wrong (A should have a token3, and token4 and token5 are nowhere to be seen).

Captain · Accepted Answer · 2014-08-27T06:55:48.860

0

Great question! Thanks for it!

select distinct k, regexp_substr(v, '[^&]+', 1, level) t
from (    
  select substr(regexp_substr(val,'^[^=]+=&'),1,length(regexp_substr(val,'^[^=]+=&'))-2) k, substr(regexp_substr(val,'=&.*'),3) v
  from (
    select regexp_substr(str, '[^,]+', 1, level) val
    from (select 'A=&token1&token2&token3,B=&token2&token3&token5' str from dual)
    connect by level <= length(str) - length(replace(str,','))+1
  )
) connect by level <= length(v) - length(replace(v,'&'))+1

It is an answer, and one that seems to work... But I don't like the middle splitting the val into kand v- there must be a better way (if the Key is always one character, that makes it easy though) . And having to put a DISTINCT to get rid of duplicates is horrible... Maybe with further playing you can clean it up though (or someone else might)

EDIT based on keeping the leading & and the key being a single character:

select distinct k, regexp_substr(v, '&[^&]+', 1, level) t
from (    
  select substr(val,1,1) k 
  , substr(regexp_substr(val,'=&.*'),1) v
  from (
    select regexp_substr(str, '[^,]+', 1, level) val
    from (select 'A=&token1&token2&token3,B=&token2&token3&token5' str from dual)
    connect by level <= length(str) - length(replace(str,','))+1
  )
) connect by level < length(v) - length(replace(v,'&'))+1

edited Aug 27 '14 at 06:55

answered Aug 26 '14 at 07:09

Captain

2,148
15
13

Fantastic! I was able to ultimately get it working only by doing some things like getting rid of nulls and using distinct as well. I like yours better, though I need to analyze it a bit. I'm not used to hierarchical queries in Oracle – Randy Magruder Aug 26 '14 at 20:13
Just one thing - my "&" is missing from the output. – Randy Magruder Aug 26 '14 at 20:21
BY THE WAY, "kEY" is ALWAYS one character – Randy Magruder Aug 26 '14 at 21:00
Why wouldn't I use "CONNECT BY LEVEL <= REGEXP_COUNT( str, expr )" instead of this length-replace approach? I'm finding that performance nosedives when level >= 7. Wouldn't I want it to just be the # of tokens that I have found in the string? Usually that's no more than a few...Maybe I'm missing something... – Randy Magruder Oct 23 '14 at 19:22
@RandyMagruder Usually string length and replacement functions are quite performant - they have been in the systems for so long. Regexp would be a fine way to do it, if it works better. You are correct that you just need the number of keys/tokens. It could be interesting to see what is there without the `DISTINCT` with many levels - it could be that it is doing a lot of redundant work (e.g. growing exponentially) – Captain Oct 23 '14 at 19:57
Yeah, I don't think it is the length/replace functions that are slow, but I'm boggled that it seems that this results in too many levels which exponentially runs the query time into infinity. REGEXP_COUNT seems to limit the LEVEL more nicely and returns almost instantly. I don't honestly understand the purpose of the LENGTH/REPLACE combination. – Randy Magruder Oct 31 '14 at 14:52
Question: how many times does the length(v) - length(replace(v,'&'))+1 get called? Does the replace modify the original string? I would have thought it just returned the replaced string, and that this calculation would only be performed ONCE to determine the maximum # of levels to connect? – Randy Magruder Oct 31 '14 at 15:14

Oracle How do I transform this string field into structured data using regular expressions?

1 Answers1