30

I currently have this regular expression to split strings by all whitespace, unless it's in a quoted segment:

keywords = 'pop rock "hard rock"';
keywords = keywords.match(/\w+|"[^"]+"/g);
console.log(keywords); // [pop, rock, "hard rock"]

However, I also want it to be possible to have quotes in keywords, like this:

keywords = 'pop rock "hard rock" "\"dream\" pop"';

This should return

[pop, rock, "hard rock", "\"dream\" pop"]

What's the easiest way to achieve this?

Blaise
  • 13,139
  • 9
  • 69
  • 97

4 Answers4

34

You can change your regex to:

keywords = keywords.match(/\w+|"(?:\\"|[^"])+"/g);

Instead of [^"]+ you've got (?:\\"|[^"])+ which allows \" or other character, but not an unescaped quote.

One important note is that if you want the string to include a literal slash, it should be:

keywords = 'pop rock "hard rock" "\\"dream\\" pop"'; //note the escaped slashes.

Also, there's a slight inconsistency between \w+ and [^"]+ - for example, it will match the word "ab*d", but not ab*d (without quotes). Consider using [^"\s]+ instead, that will match non-spaces.

Kobi
  • 135,331
  • 41
  • 252
  • 292
  • 1
    I suggest you use `\\.` instead of `\\"` because backslashes can be escaped too, and you wouldn't want to miss `"foo\\\\"`. – Tim Pietzcker Oct 27 '10 at 14:45
  • @Tim - interesting idea at first, but I'm not sure it's necessary - wouldn't `[^"]` handle these cases? Am I missing something? – Kobi Oct 27 '10 at 15:24
  • 1
    Consider this: In the string `"\\" "foo"` (just two backlashes for clarity), the first `"` would be matched by the literal `"` at the start of the regex. Then the `[^"]` would match the first \. Then the remaining `\"` would be matched by `\\"` (because it comes first in the alternation). Then `[^"]` would match the space and the `"` (at the end of the regex) would match the opening quote of `"foo"`, disrupting the parsing. – Tim Pietzcker Oct 27 '10 at 16:06
  • 1
    It works just like it should. "(?:\\"|[^"])+ which should be self-explanatory" < not really ;-), I never used this in regexps before, a colleague had to explain it to me. "Consider using [^"\s]+ instead" < This is something I already adjusted. Thanks for your help! – Blaise Oct 27 '10 at 16:11
  • @Tim - That would be the same for `"A\" "foo"` - the second quote is escaped, which fits the requirements here, but I get your point - it's a good idea not to allow lonely slashes, something like `\w+|"(?:\\.|[^"\\])+"`. – Kobi Oct 27 '10 at 16:30
  • @Blaise - no problem. It a small leap from `[^"]` to `\\"|[^"]`, but then you need a group, and you might as well make it a non-capturing group... I guess I took too many steps at once `:P` – Kobi Oct 27 '10 at 16:33
  • @Kobi: It would not be the same for `"A\" "foo"` - in `"\\" "foo"` the second quote is *not* escaped. – Tim Pietzcker Oct 27 '10 at 17:02
  • Can someone please explain the self-explanatory bit? – OldPeculier Oct 15 '14 at 20:07
  • Very nice. It seems that one problem remains: `\"a b c"` is considered to be within quotes, because the leading escape is ignored. – Timo Jun 17 '16 at 09:33
  • @Timo - That is correct. I think for the purpose of this question `\"` is only considered valid within quotes, like in programming languages. – Kobi Jun 17 '16 at 09:59
  • 1
    @Kobi Fair point. For whom it concerns, I have prepended `(?<!\\)(?:\\\\)*` to the regex. That is, *not* preceded by a backslash, and then there must be an even number of backslashes (i.e. escaped backslashes). In other words, the opening quote must be preceded by 0, 2, 4, 6, ... backslashes, or else (i.e. 1, 3, ... backslashes) we will not consider it to be an opening quote. – Timo Jun 17 '16 at 13:16
  • @Timo: unfortunately this approach is wrong, since the lookbehind doesn't exist in javascript, and even if it is possible, it isn't efficient at all. See my answer for example. – Casimir et Hippolyte Oct 19 '16 at 00:38
  • @CasimiretHippolyte Right, I forgot. It is inefficient because the lookbehind will walk back to the start of the text to count the backslashes? It does not think to execute this part as it walks forward through the text initially? – Timo Oct 19 '16 at 08:42
  • In JavaScript, if you don't want the quotes too, you can just use `keywords = keywords.match(/\w+|"(?:\\"|[^"])+"/g).map((a) -> if a.match(/".+"/g) then a.slice(1, -1) else a)`. Not pure regex, but still doesn't require a feature JavaScript still doesn't have: regex look-behinds. – wallabra Aug 29 '17 at 19:57
  • @Gustavo6046 - You probably meant `? :`. I don't think JavaScript has `then`. – Kobi Aug 29 '17 at 20:25
  • Oh yes. I forgot to _decaffeinate_ the code :P Thank you. – wallabra Aug 29 '17 at 21:25
  • An explanation of how this regex works, like @CasimiretHippolyte does in their answer, would be nice. It's also worth noting that this seems to split the string by any non-alpha-numeric characters, eg. `C.S.` -> `["C", "S"]`. – V. Rubinetti Sep 18 '20 at 15:56
  • this doesnt support a quote preceded by a scaped scape – DGoiko Dec 23 '20 at 21:51
9

ES6 solution supporting:

  • Split by space except for inside quotes
  • Removing quotes but not for backslash escaped quotes
  • Escaped quote become quote
  • Can put quotes anywhere

Code:

keywords.match(/\\?.|^$/g).reduce((p, c) => {
        if(c === '"'){
            p.quote ^= 1;
        }else if(!p.quote && c === ' '){
            p.a.push('');
        }else{
            p.a[p.a.length-1] += c.replace(/\\(.)/,"$1");
        }
        return  p;
    }, {a: ['']}).a

Output:

[ 'pop', 'rock', 'hard rock', '"dream" pop' ]
Tsuneo Yoshioka
  • 7,504
  • 4
  • 36
  • 32
4

If Kobi's answer works well for the example string, it doesn't when there are more than one successive escape characters (backslashes) between quotes as Tim Pietzcker noticed it in comments. To handle these cases, the pattern can be written like this (for the match method):

(?=\S)[^"\s]*(?:"[^\\"]*(?:\\[\s\S][^\\"]*)*"[^"\s]*)*

demo

Where (?=\S) ensures there's at least one non-white-space character at the current position since the following, that describes all allowed sub-strings (including whitespaces between quotes) is totally optional.

Details:

(?=\S)   # followed by a non-whitespace
[^"\s]*  #"# zero or more characters that aren't a quote or a whitespace
(?: # when a quoted substring occurs:
    "       #"# opening quote
    [^\\"]* #"# zero or more characters that aren't a quote or a backslash
    (?: # when a backslash is encountered:
        \\ [\s\S] # an escaped character (including a quote or a backslash)
        [^\\"]* #"#
    )*
    "         #"# closing quote
    [^"\s]*   #"#
)*
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
0

I would like to point out I had the same regex as you,

/\w+|"[^"]+"/g

but it didnt worked on empty quoted string such as :

"" "hello" "" "hi"

so I had to change the + quantifier by *. this gave me :

str.match(/\w+|"[^"]*"/g);

Which is fine.

(ex: https://regex101.com/r/wm5puK/1)

neolectron
  • 19
  • 4