How to split a sentence with an escaped whitespace?

Question

I want to split my sentence using whitespace as my delimiter except for escaped whitespaces. Using boost::split and regex, how can I split it? If not possible, how else?

Example:

std::string sentence = "My dog Fluffy\\ Cake likes to jump";

Result:
My
dog
Fluffy\ Cake
likes
to
jump

You can do it with std::stringstream http://stackoverflow.com/a/236803/4603670 or regex http://www.regexr.com/ — Barmak Shemirani, Apr 01 '15 at 06:09
@BarmakShemirani and how would you handle the escaped space? — sehe, Apr 01 '15 at 06:36
@sehe, you may use Boost Spirit, Boost Regex, or Handwritten parser. — Barmak Shemirani, Apr 01 '15 at 13:22

sehe · Accepted Answer · 2015-04-01T14:20:26.057

Three implementations:

With Boost Spirit
With Boost Regex
Handwritten parser

With Boost Spirit

Here's how I'd do this with Boost Spirit. This might seem overkill, but experience teaches me that once you're splitting input text you will likely require more parsing logic.

Boost Spirit shines when you scale from "just splitting tokens" to a real grammar with production rules.

Live On Coliru

#include <boost/spirit/include/qi.hpp>
namespace qi = boost::spirit::qi;

int main() {
    std::string const sentence = "My dog Fluffy\\ Cake likes to jump";
    using It = std::string::const_iterator;
    It f = sentence.begin(), l = sentence.end();

    std::vector<std::string> words;

    bool ok = qi::phrase_parse(f, l,
            *qi::lexeme [ +('\\' >> qi::char_ | qi::graph) ], // words
            qi::space - "\\ ", // skipper
            words);

    if (ok) {
        std::cout << "Parsed:\n";
        for (auto& w : words)
            std::cout << "\t'" << w << "'\n";
    } else {
        std::cout << "Parse failed\n";
    }

    if (f != l)
        std::cout << "Remaining unparsed: '" << std::string(f,l) << "'\n";
}

With Boost Regex

This looks really succinct but

requires linking to boost_regex
uses "black magic" negative look behind assertion: http://www.regular-expressions.info/lookaround.html

Live On Coliru

#include <iostream>
#include <boost/regex.hpp>
#include <boost/algorithm/string_regex.hpp>
#include <vector>

int main() {
    std::string const sentence = "My dog Fluffy\\ Cake likes to jump";

    std::vector<std::string> words;
    boost::algorithm::split_regex(words, sentence, boost::regex("(?<!\\\\)\\s"), boost::match_default);

    for (auto& w : words)
        std::cout << " '" << w << "'\n";
}

Using c++11 raw literals you could write the regular expression slightly less obscurely: boost::regex(R"((?<!\\)\s)"), meaning "any whitespace not following a backslash"

Handwritten parser

This is somewhat more tedious, but like the Spirit grammar is completely generic, and allow nice performance.

However, it doesn't nearly scale as gracefully as the Spirit approach once you start adding complexity to your grammar. An advantage is that you spend less time compiling the code than with the Spirit version.

Live On Coliru

#include <iostream>
#include <iterator>
#include <vector>

template <typename It, typename Out>
Out tokens(It f, It l, Out out) {
    std::string accum;
    auto flush = [&] { 
        if (!accum.empty()) {
            *out++ = accum;
            accum.resize(0);
        }
    };

    while (f!=l) {
        switch(*f) {
            case '\\': 
                if (++f!=l && *f==' ')
                    accum += ' ';
                else
                    accum += '\\';
                break;
            case ' ': case '\t': case '\r': case '\n':
                ++f;
                flush();
                break;
            default:
                accum += *f++;
        }
    }
    flush();
    return out;
}

int main() {
    std::string const sentence = "My dog Fluffy\\ Cake likes to jump";

    std::vector<std::string> words;

    tokens(sentence.begin(), sentence.end(), back_inserter(words));

    for (auto& w : words)
        std::cout << "\t'" << w << "'\n";
}

I used the boost regex one you provided, and it works perfectly. Thanks a plenty. — AppleJuice, Apr 01 '15 at 20:29
@AppleJuice you realize that you chose the ugly stepchild right :) The only one that comes with link dependencies, requires exemptions in your life insurances, and requires you to manually remove the escape even after it was parsed :) (luckily enough it doesn't require a virgin sacrifice to compile, like #1; and #3 induces [tag:C] envy). Cheers — sehe, Apr 01 '15 at 20:32

How to split a sentence with an escaped whitespace?

1 Answers1

With Boost Spirit

With Boost Regex

Handwritten parser

Linked