Tokenize a string in Oracle preserving literals

Question

I am starting with data like this:

John went "to the store"

I'd like to tokenize this into

John
went
to the store

I am trying to use the REGEXP_SUBSTR method to do this, but can't seem to come up with a regular expression that says 'match non-spaces that are not between double-quotes'

SELECT DISTINCT
    REGEXP_SUBSTR ('John went "to the store"', '[^[:space:]]+', 1, level) AS word
FROM
    dual
    CONNECT BY level <= LENGTH(regexp_replace('John went "to the store"','[^[:space:]]+'))
    +1
ORDER BY word ASC;

My brain is a little regexp fried at this point. Thanks in advance!

Is your delimiter a single space or could it be multiple spaces between words? In other words if 2 spaces are encountered, is that really a NULL list element between them? — Gary_W, Mar 18 '16 at 17:28
Could there be more than one double-quoted string in the main string? Please add to your original post a sample of actual strings you have to deal with. — Gary_W, Mar 18 '16 at 17:36
Two spaces should be treated as one. And yes, there could be more than one double quoted string within the data. This is to parse a search string. — Aaron, Mar 18 '16 at 18:28
@AlexPoole - That is very close to what I was looking for. I used the top solution with regexp_count in the connect by to remove the NULL rows, and then added a TRIM to get rid of the double quotes. If you answer I will accept. — Aaron, Mar 18 '16 at 18:42
@Aaron - if it is that similar then the question is a duplicate *8-) — Alex Poole, Mar 18 '16 at 18:48
Well just in case anyone else needs the answer that worked for me - `SELECT DISTINCT TRIM('"' FROM REGEXP_SUBSTR (v_words, '("[^"]*"|[^[:space:]]+)', 1, level)) AS word FROM dual CONNECT BY level <= regexp_count(v_words,'("[^"]*"|[^[:space:]]+)')` — Aaron, Mar 18 '16 at 19:20

Tokenize a string in Oracle preserving literals

0 Answers0