Extracting tokens from string in C++

Question

Edit: I'm looking for a solution that doesn't use regex since it seems buggy and not trustable

I had the following function which extracts tokens of a string whenever one the following symbols is found: +,-,^,*,!

bool extract_tokens(string expression, std::vector<string> &tokens) {    
    static const std::regex reg(R"(\+|\^|-|\*|!|\(|\)|([\w|\s]+))");
    std::copy(std::sregex_token_iterator(right_token.begin(), right_token.end(), reg, 0),
              std::sregex_token_iterator(),
              std::back_inserter(tokens));
    return true;
}

I though it worked perfectly until today I found an edge case, The following input : !aaa + ! a is supposed to return !,aaa ,+,!, a But it returns !,aaa ,+,"",!, a Notice the extra empty string between + and !.

How may I prevent this behaviour? I think this can be done with the regex expression,

Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/219474/discussion-on-question-by-dure-extracting-tokens-from-string-in-c). — Samuel Liew, Aug 09 '20 at 03:37

Thomas Sablik · Answer 1 · 2020-08-08T21:43:56.853

0

Inspired by https://stackoverflow.com/a/9436872/4645334 you could solve the problem with:

bool extract_tokens(std::string expression, std::vector<std::string> &tokens) {
  std::string token;

  for (const auto& c: expression) {
    if (c == '/' || c == '-' || c == '*' || c == '+' || c == '!') {
      if (token.length() && !std::all_of(token.cbegin(), token.cend(), [](auto c) { return c == ' '; })) tokens.push_back(token);
      token.clear();
      tokens.emplace_back(1, c);
    } else {
      token += c;
    }
  }

  if (token.length() && !std::all_of(token.cbegin(), token.cend(), [](auto c) { return c == ' '; })) tokens.push_back(token);
     
  return true;
}

Input:

"!aaa + ! a"

Output:

"!","aaa ","+","!"," a"

edited Aug 08 '20 at 21:43

answered Aug 08 '20 at 20:53

Thomas Sablik

16,127
7
34
62

`std::string(c, 1)` wouldn't compile. You probably meant `tokens.push_back(c)`. Anyway, this has the same problem as the original question - it produces tokens that consist of whitespace alone. The OP wants `! +` to be split into two tokens `!` and `+`, not three tokens. – Igor Tandetnik Aug 08 '20 at 21:01
@IgorTandetnik `c` is a char. `tokens.push_back(c)` wouldn't compile: https://wandbox.org/permlink/cx8csmVomCMroaiy – Thomas Sablik Aug 08 '20 at 21:04
Ah, sorry. I confused `token` and `tokens`. Still, to construct a string of one character, it should be `std::string(1, c)`. You have your arguments the wrong way round. – Igor Tandetnik Aug 08 '20 at 21:07

score 0 · Accepted Answer · answered Aug 08 '20 at 21:16

In an attempt to salvage the regular expression-based solution, I came up with this:

[-+^*!()]|\s*[^-+^*!()\s][^-+^*!()]*

Demo. This reports delimiters, and anything between delimiters including leading and trailing whitespace, but drops tokens consisting of whitespace alone.

A similar expression that also strips leading and trailing whitespace:

[-+^*!()]|[^-+^*!()\s]+(\s+[^-+^*!()\s]+)*)

Demo

Extracting tokens from string in C++

2 Answers2