0

I am starting with data like this:

  • John went "to the store"

I'd like to tokenize this into

  • John
  • went
  • to the store

I am trying to use the REGEXP_SUBSTR method to do this, but can't seem to come up with a regular expression that says 'match non-spaces that are not between double-quotes'

SELECT DISTINCT
    REGEXP_SUBSTR ('John went "to the store"', '[^[:space:]]+', 1, level) AS word
FROM
    dual
    CONNECT BY level <= LENGTH(regexp_replace('John went "to the store"','[^[:space:]]+'))
    +1
ORDER BY word ASC;

My brain is a little regexp fried at this point. Thanks in advance!

Aaron
  • 414
  • 4
  • 16
  • 1
    Is your delimiter a single space or could it be multiple spaces between words? In other words if 2 spaces are encountered, is that really a NULL list element between them? – Gary_W Mar 18 '16 at 17:28
  • Could there be more than one double-quoted string in the main string? Please add to your original post a sample of actual strings you have to deal with. – Gary_W Mar 18 '16 at 17:36
  • Two spaces should be treated as one. And yes, there could be more than one double quoted string within the data. This is to parse a search string. – Aaron Mar 18 '16 at 18:28
  • @AlexPoole - That is very close to what I was looking for. I used the top solution with regexp_count in the connect by to remove the NULL rows, and then added a TRIM to get rid of the double quotes. If you answer I will accept. – Aaron Mar 18 '16 at 18:42
  • @Aaron - if it is that similar then the question is a duplicate *8-) – Alex Poole Mar 18 '16 at 18:48
  • Well just in case anyone else needs the answer that worked for me - `SELECT DISTINCT TRIM('"' FROM REGEXP_SUBSTR (v_words, '("[^"]*"|[^[:space:]]+)', 1, level)) AS word FROM dual CONNECT BY level <= regexp_count(v_words,'("[^"]*"|[^[:space:]]+)')` – Aaron Mar 18 '16 at 19:20

0 Answers0