2

Tokens are separated by 1 or more spaces. "A quoted string" is a single token. Anything else not beginning with a quote is a token. I tried and failed with:

var tokenre = /"[^"]*"|[^"]\S+|\s\s*/g;

For instance I want this input

[4,4]  "This is fun"
 2  2 +
 #

To tokenize as

['[4,4]', '  ', '"This is fun"', '\n ', '2', '  ', '2', ' ', '+', '\n ', '#']

This could be tested with the following code:

var result = null;
do {
    result = tokenre.exec (program);
    console.log (result);
} while (result != null);
jlettvin
  • 1,113
  • 7
  • 13
  • While this is for Java, I think this question might be helpful to you: https://stackoverflow.com/questions/366202/regex-for-splitting-a-string-using-space-when-not-surrounded-by-single-or-double –  Jul 17 '18 at 17:58
  • 2
    Looks like you need `.match(/"[^"]*"|\S+|\s+/g)`, please check if it is in line with your requirements. – Wiktor Stribiżew Jul 17 '18 at 17:59
  • Your comment works. Thank you. I will say so in an answer. – jlettvin Jul 17 '18 at 19:50

1 Answers1

3

It seems you want to tokenize a string into whitespace and non-whitespace char chunks, but also separate "..." like substrings between quotes into separate elements.

You may achieve it using

s.match(/"[^"]*"|\S+|\s+/g)

See the regex demo.

Details

  • "[^"]*" - a ", then any 0+ chars other than a quote, and then a " (NOTE: to match regular escape sequences, you need to replace it with "[^"\\]*(?:\\[\s\S][^"\\]*)*")
  • | - or
  • \S+ - 1+ non-whitespace chars
  • | - or
  • \s+ - 1+ whitespace chars.

JS demo:

var s = "[4,4]  \"This is fun\"\n2  2 +\n#";
console.log(s.match(/"[^"]*"|\S+|\s+/g));
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563