1

I want to split my sentence using whitespace as my delimiter except for escaped whitespaces. Using boost::split and regex, how can I split it? If not possible, how else?

Example:

std::string sentence = "My dog Fluffy\\ Cake likes to jump";

Result:
My
dog
Fluffy\ Cake
likes
to
jump

AppleJuice
  • 13
  • 4

1 Answers1

3

Three implementations:

  1. With Boost Spirit
  2. With Boost Regex
  3. Handwritten parser

With Boost Spirit

Here's how I'd do this with Boost Spirit. This might seem overkill, but experience teaches me that once you're splitting input text you will likely require more parsing logic.

Boost Spirit shines when you scale from "just splitting tokens" to a real grammar with production rules.

Live On Coliru

#include <boost/spirit/include/qi.hpp>
namespace qi = boost::spirit::qi;

int main() {
    std::string const sentence = "My dog Fluffy\\ Cake likes to jump";
    using It = std::string::const_iterator;
    It f = sentence.begin(), l = sentence.end();

    std::vector<std::string> words;

    bool ok = qi::phrase_parse(f, l,
            *qi::lexeme [ +('\\' >> qi::char_ | qi::graph) ], // words
            qi::space - "\\ ", // skipper
            words);

    if (ok) {
        std::cout << "Parsed:\n";
        for (auto& w : words)
            std::cout << "\t'" << w << "'\n";
    } else {
        std::cout << "Parse failed\n";
    }

    if (f != l)
        std::cout << "Remaining unparsed: '" << std::string(f,l) << "'\n";
}

With Boost Regex

This looks really succinct but

Live On Coliru

#include <iostream>
#include <boost/regex.hpp>
#include <boost/algorithm/string_regex.hpp>
#include <vector>

int main() {
    std::string const sentence = "My dog Fluffy\\ Cake likes to jump";

    std::vector<std::string> words;
    boost::algorithm::split_regex(words, sentence, boost::regex("(?<!\\\\)\\s"), boost::match_default);

    for (auto& w : words)
        std::cout << " '" << w << "'\n";
}

Using c++11 raw literals you could write the regular expression slightly less obscurely: boost::regex(R"((?<!\\)\s)"), meaning "any whitespace not following a backslash"

Handwritten parser

This is somewhat more tedious, but like the Spirit grammar is completely generic, and allow nice performance.

However, it doesn't nearly scale as gracefully as the Spirit approach once you start adding complexity to your grammar. An advantage is that you spend less time compiling the code than with the Spirit version.

Live On Coliru

#include <iostream>
#include <iterator>
#include <vector>

template <typename It, typename Out>
Out tokens(It f, It l, Out out) {
    std::string accum;
    auto flush = [&] { 
        if (!accum.empty()) {
            *out++ = accum;
            accum.resize(0);
        }
    };

    while (f!=l) {
        switch(*f) {
            case '\\': 
                if (++f!=l && *f==' ')
                    accum += ' ';
                else
                    accum += '\\';
                break;
            case ' ': case '\t': case '\r': case '\n':
                ++f;
                flush();
                break;
            default:
                accum += *f++;
        }
    }
    flush();
    return out;
}

int main() {
    std::string const sentence = "My dog Fluffy\\ Cake likes to jump";

    std::vector<std::string> words;

    tokens(sentence.begin(), sentence.end(), back_inserter(words));

    for (auto& w : words)
        std::cout << "\t'" << w << "'\n";
}
sehe
  • 374,641
  • 47
  • 450
  • 633
  • I used the boost regex one you provided, and it works perfectly. Thanks a plenty. – AppleJuice Apr 01 '15 at 20:29
  • @AppleJuice you realize that you chose the ugly stepchild right :) The only one that comes with link dependencies, requires exemptions in your life insurances, and requires you to manually remove the escape even after it was parsed :) (luckily enough it doesn't require a virgin sacrifice to compile, like #1; and #3 induces [tag:C] envy). Cheers – sehe Apr 01 '15 at 20:32