Regex (JS Notation): Select spaces not in [ [], {}, "" ] to tokenize string

Question

So I need to tokenize a string by all spaces not between quotes, I am using regex in Javascript notation.

For example:

" Test Test " ab c " Test" "Test " "Test" "T e s t"

becomes

[" Test Test ",ab,c," Test","Test ","Test","T e s t"]

For my use case however, the solution should work in the following test setting: https://www.regextester.com/

All Spaces not within quotes should be highlighted in the above setting. If they are highlighted in the above setting they would be parsed correctly in my program.

For more specificity, I am using Boost::Regex C++ to do the parsing as follows:

...
std::string test_string("\" Test Test \" ab c \" Test\" \"Test \" \"Test\" \"T e s t\"");
// (,|;)?\\s+     : Split on ,\s or ;\s
// (?![^\\[]*\\]) : Ignore spaces inside []
// (?![^\\{]*\\}) : Ignore spaces inside {}
// (?![^\"].*\")  : Ignore spaces inside "" !!! MY ATTEMPT DOESN'T WORK !!!

//Note the below regex delimiter declaration does not include the erroneous regex.
boost::regex delimiter("(,|;\\s|\\s)+(?![^\\[]*\\])(?![^\\(]*\\))(?![^\\{]*\\})");
std::vector<std::string> string_vector;
boost::split_regex(string_vector, test_string, delimiter);

For those of you who do not use Boost::regex or C++ the above link should enable testing of viable regex for the above use case.

Thank you all for you assistance I hope you can help me with the above problem.

It's possible this may not be a great place to use regular expressions. They're a fantastic tool, but they have limits. — Chris, Jan 08 '23 at 22:44
@Evg It is just a regex question technically. 1. Javascript Tag: The Regex uses Javascript formatting not PCRE (Perl Compatible Regex Expressions) 2. C++ Tag: I am using Boost::Regex technically for my project which is a C++ library. Answers that address my Boost::Regex desired solution are appreciated but again not necessary to answer the above. — Warren Niles, Jan 08 '23 at 23:36

sehe · Accepted Answer · 2023-01-08T23:42:26.957

I would 100% not use regular expressions for this. First off, because it's way easier to express as a PEG grammar instead. E.g.:

std::vector<std::string> tokens(std::string_view input) {
    namespace x3 = boost::spirit::x3;
    std::vector<std::string> r;

    auto atom                            //
        = '[' >> *~x3::char_(']') >> ']' //
        | '{' >> *~x3::char_('}') >> '}' //
        | '"' >> *~x3::char_('"') >> '"' //
        | x3::graph;

    auto token = x3::raw[*atom];

    parse(input.begin(), input.end(), token % +x3::space, r);
    return r;
}

This, off the bat, already performs as you intend:

Live On Coliru

int main() {
    for (std::string const input : {R"(" Test Test " ab c " Test" "Test " "Test" "T e s t")"}) {
        std::cout << input << "\n";
        for (auto& tok : tokens(input))
            std::cout << " - " << quoted(tok, '\'') << "\n";
    }
}

Output:

" Test Test " ab c " Test" "Test " "Test" "T e s t"
 - '" Test Test "'
 - 'ab'
 - 'c'
 - '" Test"'
 - '"Test "'
 - '"Test"'
 - '"T e s t"'

BONUS

Where this really makes the difference, is when you realize that you wanted to be able to handle nested constructs (e.g. "string" [ {1,2,"3,4", [true,"more [string]"], 9 }, "bye ]).

Regular expressions are notoriously bad at this. Spirit grammar rules can be recursive though. If you make your grammar description more explicit I could show you examples.

Thank you, even it's not a popularity contest. I hope this helps you solve your problem (or realize the actual problem - heed the second part of the answer!) — sehe, Jan 08 '23 at 23:43

Peter Thoeny · Answer 2 · 2023-01-08T23:19:15.980

You can use multiple regexes if you are ok with that. The idea is to replace spaces inside quotes with a non-printable char (\x01), and restore them after the split:

const input = `" Test Test " ab c " Test" "Test " "Test" "T e s t"`;
let result = input
  .replace(/"[^"]*"/g, m => m.replace(/ /g, '\x01')) // replace spaces inside quotes
  .split(/ +/) // split on spaces
  .map(s => s.replace(/\x01/g, ' ')); // restore spaces inside quotes
console.log(result);

If you have escaped quotes within a string, such as "a \"quoted\" token" you can use this regex instead:

const input = `"A \"quoted\" token" " Test Test " ab c " Test" "Test " "Test" "T e s t"`;
let result = input
  .replace(/".*?[^\\]"/g, m => m.replace(/ /g, '\x01')) // replace spaces inside quotes
  .split(/ +/) // split on spaces
  .map(s => s.replace(/\x01/g, ' ')); // restore spaces inside quotes
console.log(result);

If you want to parse nested brackets you need a proper language parser. You can also do that with regexes however: Parsing JavaScript objects with functions as JSON

Learn more about regex: https://twiki.org/cgi-bin/view/Codev/TWikiPresentation2018x10x14Regex

Regex (JS Notation): Select spaces not in [ [], {}, "" ] to tokenize string

2 Answers2

BONUS