2

You would think this one has been asked before but I cant find it.

I need to separate a js string by un quoted commas. I'm only using double quotes so that should make it a bit simpler.

I have tried two approaches but not nailed it.

I need to turn this:

'body.loaded"who, are , you" div"hello ,"#div-id span CODE, body.loaded span"span, text" code'

into this:

[
 'body.loaded"who, are , you" div"hello ,"#div-id span CODE',
 'body.loaded span"span, text" code'
]

1) -> match the good parts, which mostly works but gives me allot of empty strings in my result.

'body.loaded"who, are , you" div"hello ,"#div-id span CODE, body.loaded span"span, text" code'.match(
  /([^,]*"[^"]*")*/g
)

['body.loaded"who, are , you" div"hello' ,'', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', ' body.loaded span"span, text"', '', '', '', '', '', '']

I think its because of the () in the regex.

2) split the bad parts, which isnt quite there yet. The idea here is to match commas followed by an even number of ".

'body.loaded"who, are , you" div"hello ,"#div-id span CODE, body.loaded span"span, text" code'.split(
    /,(?![^"]*"[^"]*("[^"]*"[^"]*)*$)/
);

Basically, there has to be a cleaner simpler and more beautiful solution (bear in mind javascript does not support look behinds).

Roderick Obrist
  • 3,688
  • 1
  • 16
  • 17

1 Answers1

13

Assuming you don't support escapes in your double-quoted strings, this should probably work:

/(?:"[^"]*"|[^,])+/g

If you do want to support backslash-escapes inside of double-quoted strings, this should do the job:

/(?:"(?:\\.|[^"])*"|[^,])+/g

If you want to support backslash-escapes outside of double-quoted strings too (e.g. escaping the initial quote), then try this:

/(?:"(?:\\.|[^"])*"|\\.|[^,])+/g

Here's an explanation for how the third pattern works.

First, an expanded, annotated version:

(?:      # start a non-capturing group
  "      # Match a double quote
  (?:    # Another non-capturing group, for the contents of the double-quote
    \\.  # Match any backslash-escaped character
  | [^"] # or any non-double-quote character
  )*     # End the group. Repeat zero or more times
  "      # Close double quote
|        # Alternative to double-quoted string
  \\.    # Match any escaped character
|        # Another alternative
  [^,]   # Match any non-comma character
)+       # Close group, repeat one or more times

There's three primary components here.

The first is to match any double-quoted string. This comes first in the group because if a double-quoted string can possibly match here, it should, as opposed to using the non-comma rule. Inside this double-quoted string we can match either any escaped character (\\.), which lets us escape double-quotes inside the string, or we match any non-double-quote character. We only match one character at a time so as to not catch escapes with the non-double-quote character rule. The contents of the string use * because double-quoted strings may be empty, and then we terminate the string.

Instead of a double-quoted string, we may just match any escaped character (\\.). This lets us escape a double-quote character while outside of a double-quoted string. It actually lets us escape a comma too, which I'm not sure if you want. If you don't want it, this rule should turn into \\[^,] instead.

And lastly, if we can't match a double-quoted string here, and we can't match an escape, just match any non-comma character. This is not repeated so as to not catch later double-quotes or escapes with this rule.

Then we go ahead and repeat the entire pattern with the + modifier. This lets us match more than one token at a time. We use + instead of * to avoid returning empty strings in our result.

Lily Ballard
  • 182,031
  • 33
  • 381
  • 347
  • Wow, your a straight up ninja, I spent like 3 hours on that thing and you did it in 10 seconds. Could you explain the logic behind this solution though, just so I don't have to bother any more ninjas again. – Roderick Obrist Mar 07 '12 at 22:24
  • @RoderickObrist: I'll post an explanation of the longest one inside of the answer in a minute. – Lily Ballard Mar 07 '12 at 22:26
  • @RoderickObrist: I hope my extended explanation is suitable. If anything is unclear, feel free to ask more questions. – Lily Ballard Mar 07 '12 at 22:35