Split a line using std::regex and discard empty elements

Question

I need to split a line based on two separators: ' ' and ;.

By example:

input : " abc  ; def  hij  klm  "
output: {"abc","def","hij","klm"}

How can I fix the function below to discard the first empty element?

std::vector<std::string> Split(std::string const& line) {
    std::regex seps("[ ;]+");
    std::sregex_token_iterator rit(line.begin(), line.end(), seps, -1);
    return std::vector<std::string>(rit, std::sregex_token_iterator());
}

// input : " abc  ; def  hij  klm  "
// output: {"","abc","def","hij","klm"}

Below a complete sample that compiles:

#include <iostream>
#include <string>
#include <vector>
#include <regex>

std::vector<std::string> Split(std::string const& line) {
    std::regex seps("[ ;]+");
    std::sregex_token_iterator rit(line.begin(), line.end(), seps, -1);
    return std::vector<std::string>(rit, std::sregex_token_iterator());
}

int main()
{
    std::string line = " abc  ; def  hij  klm  ";
    std::cout << "input: \"" << line << "\"" << std::endl;

    auto collection = Split(line);

    std::cout << "output: {";
    auto bComma = false;
    for (auto oneField : collection)
    {
        std::cout << (bComma ? "," : "") << "\"" << oneField << "\"";
        bComma = true;
    }
    std::cout << "} " << std::endl;
}

Jerry Coffin · Accepted Answer · 2017-05-18T23:57:28.640

I can see a couple possibilities beyond what's been mentioned in the other questions so far. The first would be to use std::remove_copy_if when building your vector:

// regex stuff here
std::vector<std::string> tokens;
std::remove_copy_if(rit, std::sregex_token_iterator(), 
                    std::back_inserter(tokens),
                    [](std::string const &s) { return s.empty(); });

Another possibility would be to create a locale that classified characters appropriately, and just read from there:

struct reader: std::ctype<char> {
    reader(): std::ctype<char>(get_table()) {}
    static std::ctype_base::mask const* get_table() {
        static std::vector<std::ctype_base::mask> rc(table_size, std::ctype_base::mask());

        rc[' '] = std::ctype_base::space;
        rc[';'] = std::ctype_base::space;

        // at a guess, newlines are probably still separators too:
        rc['\n'] = std::ctype_base::space;
        return &rc[0];
    }
};

Once we have this, we tell the stream to use that locale when reading from (or writing to) the stream:

std::stringstream input(" abc  ; def  hij  klm  ");

input.imbue(std::locale(std::locale(), new reader));

Then we probably want to clean up the code for inserting commas only between tokens, rather than after every token. Fortunately, I wrote some code to handle that fairly neatly some time ago. Using it, we can copy tokens from the input above to standard output fairly simply:

std::cout << "{ ";
std::copy(std::istream_iterator<std::string>(input), {}, 
    infix_ostream_iterator<std::string>(std::cout, ", "));  
std::cout << " }";

Result: "{ abc, def, hij, klm }", exactly as you'd expect/hope for--without any extra kludges to make up for its starting out doing the wrong thing.

I got the compilation error: no instance of overloaded function "std::remove_copy_if" matches the argument list. Here my code: std::vector SplitLine(std::string const& line, const std::regex seps) { std::sregex_token_iterator rit(line.begin(), line.end(), seps, -1); std::vector tokens; std::remove_copy_if(rit, std::sregex_token_iterator(), [](std::string const &s) { return s.empty(); }); return tokens;} — Less White, May 18 '17 at 23:51
@LessWhite: Oops--I seem to have left out the third iterator to tell `remove_copy_if` where it should write its result. I've edited. — Jerry Coffin, May 18 '17 at 23:56
Got it. The use of std::remove_copy_if looks simpler to me. Thanks a lot. — Less White, May 19 '17 at 00:14

score 2 · Answer 2 · answered May 18 '17 at 19:52

You could always add an extra step at the end of the function to prune out the empty strings altogether, using the erase-remove idiom

std::vector<std::string> Split(std::string const& line) {
    std::regex seps("[ ;]+");
    std::sregex_token_iterator rit(line.begin(), line.end(), seps, -1);
    auto tokens = std::vector<std::string>(rit, std::sregex_token_iterator());
    tokens.erase(std::remove_if(tokens.begin(),
                                tokens.end(),
                                [](std::string const& s){ return s.empty(); }),
                 tokens.end());
    return tokens;
}

score 1 · Answer 3 · answered May 18 '17 at 19:58

If you do not want to remove the elements from the vector after you populate it you can also traverse the iterator range and build the vector skipping the empty matches like

std::vector<std::string> Split(std::string const& line) {
    std::regex seps("[ ;]+");
    std::sregex_token_iterator rit(line.begin(), line.end(), seps, -1), end;
    std::vector<std::string> tokens;
    for(;rit != end; ++rit);
        if (rit->length() != 0)
            tokens.push_back(*rit)
    return tokens;
}

Thanks for the alternative and the clarification of my question! — Less White, May 18 '17 at 20:21

score 0 · Answer 4 · answered May 19 '17 at 00:18

In case someone wants to copy the function revised based on the Jerry Coffin input using std::remove_copy_if:

std::vector<std::string> SplitLine(std::string const& line, const std::regex seps) 
{
    std::sregex_token_iterator rit(line.begin(), line.end(), seps, -1);
    std::vector<std::string> tokens;
    std::remove_copy_if(rit, std::sregex_token_iterator(),
        std::back_inserter(tokens),
        [](std::string const &s) { return s.empty(); });
    return tokens;
}

Split a line using std::regex and discard empty elements

4 Answers4

Linked