Splitting string with multiple delimiters, allowing quoted values

Question

The docs for boost::escaped_list_separator provide the following explanation for the second parameter c:

Any character in the string c, is considered to be a separator.

So, I need to split the string with multiple separators, allowing the quoted values, which can contain these separators within:

#include <iostream>
#include <string>

#include <boost/tokenizer.hpp>

int main() {
    std::wstring str = L"2   , 14   33  50   \"AAA BBB\"";

    std::wstring escSep(L"\\"); //escape character
    std::wstring delim(L" \t\r\n,"); //split on spaces, tabs, new lines, commas
    std::wstring quotes(L"\""); //allow double-quoted values with delimiters within

    boost::escaped_list_separator<wchar_t> separator(escSep, delim, quotes);
    boost::tokenizer<boost::escaped_list_separator<wchar_t>, std::wstring::const_iterator, std::wstring> tok(str, separator);

    for(auto beg=tok.begin(); beg!=tok.end();++beg)
        std::wcout << *beg << std::endl;

    return 0;
}

The expected result would be [2; 14; 33; 50; AAA BBB]. However, his code results in bunch of empty tokens:

Regular boost::char_separator omits all these empty tokens, considering all delimiters. It seems that boost::escaped_list_separator also considers all specified delimiters, but produces empty values. Is it true that if multiple consecutive delimiters are encountered, it will produce empty tokens? Is there any way to avoid this?

If it's always true, that only empty tokens are produced, it's easy to test the resulting values and omit them manually. But, it can get pretty ugly. For example, imagine strings each with 2 actual values and possibly with many tabs AND spaces separating the values. Then specifying delimiters as L"\t " (i.e. space and tab) will work, but produce a ton of empty tokens.

What you want is more like parsing than tokenizing. At the very least you need stateful scanning - which makes it unlike splitting. I'd always use a parser generator approach here. I have many many examples for that on this site (see e.g. https://stackoverflow.com/questions/10289985/parse-quoted-strings-with-boostspirit/10294577#10294577). — sehe, Apr 28 '17 at 09:40

sigbjornlo · Accepted Answer · 2017-04-30T10:25:57.637

Judging by the Boost Tokenizer documentation, you are indeed correct in assuming that if multiple consecutive delimiters are encountered empty tokens will be produced when using boost::escaped_list_separator. Unlike boost::char_separator, boost::escaped_list_separator does not provide any constructor that allows you to pass in whether to keep or discard any empty tokens produced.

While having the option to discard empty tokens can be nice, when you consider the use case (parsing CSV files) presented in the documentation (http://www.boost.org/doc/libs/1_64_0/libs/tokenizer/escaped_list_separator.htm), keeping empty tokens makes perfect sense. An empty field is still a field.

One option is to simply discard empty tokens after tokenizing. If the generation of empty tokens concerns you, an alternative is removing repeated delimiters prior to passing it to the tokenizer, but obviously you will need to take care not to remove anything inside quotes.

Splitting string with multiple delimiters, allowing quoted values

1 Answers1