42

I have some text (meaningful text or arithmetical expression) and I want to split it into words.
If I had a single delimiter, I'd use:

std::stringstream stringStream(inputString);
std::string word;
while(std::getline(stringStream, word, delimiter)) 
{
    wordVector.push_back(word);
}

How can I break the string into tokens with several delimiters?

Baum mit Augen
  • 49,044
  • 25
  • 144
  • 182
Sergei G
  • 1,550
  • 3
  • 24
  • 44
  • Boost.StringAlgorithm or Boost.Tokenizer would help. – K-ballo Oct 01 '11 at 17:14
  • Or, some idea you can get from this answer : http://stackoverflow.com/questions/4888879/elegant-ways-to-count-the-frequency-of-words-in-a-file – Nawaz Oct 01 '11 at 17:17
  • 3
    @K-ballo: According to the question, you should not use external libraries like Boost. – masoud Oct 01 '11 at 17:17
  • 1
    @MasoudM.: Does Boost still count as an external library ;) ? As far as I am concerned, Boost is like my Standard Library, it's built-in! – Matthieu M. Oct 01 '11 at 17:30
  • 1
    @MatthieuM.: Then, Qt is not external library for me too. – masoud Oct 01 '11 at 17:51
  • Thanks for the link on frequency counting! :) – Sergei G Oct 01 '11 at 18:28
  • @MasoudM.: To each its own :) The one key difference though is that a number of Boost libraries are experimentation before inclusion in the Standard (Boost.Regex, Boost.Thread and Boost.Unordered have been included with few tweaks in C++11, Boost.FS is the basis for the filesystem reflexion for C++1x). – Matthieu M. Oct 02 '11 at 09:56

7 Answers7

56

Assuming one of the delimiters is newline, the following reads the line and further splits it by the delimiters. For this example I've chosen the delimiters space, apostrophe, and semi-colon.

std::stringstream stringStream(inputString);
std::string line;
while(std::getline(stringStream, line)) 
{
    std::size_t prev = 0, pos;
    while ((pos = line.find_first_of(" ';", prev)) != std::string::npos)
    {
        if (pos > prev)
            wordVector.push_back(line.substr(prev, pos-prev));
        prev = pos+1;
    }
    if (prev < line.length())
        wordVector.push_back(line.substr(prev, std::string::npos));
}
SoapBox
  • 20,457
  • 3
  • 51
  • 87
  • 4
    You're too fast for me :p If newline is not a delimiter, then simply picking one of the "regular" delimiters (and removing it from the inner loop) will work. – Matthieu M. Oct 01 '11 at 17:32
  • This is the most reasonable way to split by multiple delimiters on the internet without using boost/crazy templates – router Nov 30 '22 at 06:43
24

If you have boost, you could use:

#include <boost/algorithm/string.hpp>
std::string inputString("One!Two,Three:Four");
std::string delimiters("|,:");
std::vector<std::string> parts;
boost::split(parts, inputString, boost::is_any_of(delimiters));
Matthew Smith
  • 6,165
  • 6
  • 34
  • 35
19

Using std::regex

A std::regex can do string splitting in a few lines:

std::regex re("[\\|,:]");
std::sregex_token_iterator first{input.begin(), input.end(), re, -1}, last;//the '-1' is what makes the regex split (-1 := what was not matched)
std::vector<std::string> tokens{first, last};

Try it yourself

darune
  • 10,480
  • 2
  • 24
  • 62
  • 3
    How come this has only few up votes. This is absolutely brilliant ! Few lines, no need for external library and something novel. Thank you very much ! – Berkay Berabi Jul 08 '20 at 09:31
  • 2
    @berkayberabi no problem - it was a late answer, I think thats why. If you want to, you can post a bounty to reward an existing answer (that also draws attention). – darune Jul 08 '20 at 09:40
  • Hi I dont have that much reputation. But one more question I have. If I also want to split based on brackets as delimeter([]) how can I pass the brackets. If I just enter them, they will be interpreted as another regular expression and it does not work – Berkay Berabi Jul 09 '20 at 09:36
  • 1
    @berkayberabi escape via `\\]` – darune Jul 09 '20 at 11:08
  • How about this case? text = "Windows. Apple"; I only want to see ['Windows' , 'Apple']. This regex gives [`Windows` ,' ', `Apple`], which contains a space (' ') I dont want. – cpchung May 08 '21 at 03:06
  • @cpchung it all depends on your input and what your aim is. In your case, just make match multiple with sometime like [\\.\\s]+ should solve it. – darune Dec 26 '22 at 19:08
6

I don't know why nobody pointed out the manual way, but here it is:

const std::string delims(";,:. \n\t");
inline bool isDelim(char c) {
    for (int i = 0; i < delims.size(); ++i)
        if (delims[i] == c)
            return true;
    return false;
}

and in function:

std::stringstream stringStream(inputString);
std::string word; char c;

while (stringStream) {
    word.clear();

    // Read word
    while (!isDelim((c = stringStream.get()))) 
        word.push_back(c);
    if (c != EOF)
        stringStream.unget();

    wordVector.push_back(word);

    // Read delims
    while (isDelim((c = stringStream.get())));
    if (c != EOF)
        stringStream.unget();
}

This way you can do something useful with the delims if you want.

forumulator
  • 836
  • 12
  • 12
  • 1
    You can move std::string word; and char c; inside the loop and avoid using clear()... variables should be as local and short-lived as possible. – Mohan Dec 04 '17 at 21:15
2

If you interesting in how to do it yourself and not using boost.

Assuming the delimiter string may be very long - let say M, checking for every char in your string if it is a delimiter, would cost O(M) each, so doing so in a loop for all chars in your original string, let say in length N, is O(M*N).

I would use a dictionary (like a map - "delimiter" to "booleans" - but here I would use a simple boolean array that has true in index = ascii value for each delimiter).

Now iterating on the string and check if the char is a delimiter is O(1), which eventually gives us O(N) overall.

Here is my sample code:

const int dictSize = 256;    

vector<string> tokenizeMyString(const string &s, const string &del)
{
    static bool dict[dictSize] = { false};

    vector<string> res;
    for (int i = 0; i < del.size(); ++i) {      
        dict[del[i]] = true;
    }

    string token("");
    for (auto &i : s) {
        if (dict[i]) {
            if (!token.empty()) {
                res.push_back(token);
                token.clear();
            }           
        }
        else {
            token += i;
        }
    }
    if (!token.empty()) {
        res.push_back(token);
    }
    return res;
}


int main()
{
    string delString = "MyDog:Odie, MyCat:Garfield  MyNumber:1001001";
//the delimiters are " " (space) and "," (comma) 
    vector<string> res = tokenizeMyString(delString, " ,");

    for (auto &i : res) {

        cout << "token: " << i << endl;
    }
return 0;
}

Note: tokenizeMyString returns vector by value and create it on the stack first, so we're using here the power of the compiler >>> RVO - return value optimization :)

Kohn1001
  • 3,507
  • 1
  • 24
  • 26
1

And here, ages later, a solution using C++20:

constexpr std::string_view words{"Hello-_-C++-_-20-_-!"};
constexpr std::string_view delimeters{"-_-"};
for (const std::string_view word : std::views::split(words, delimeters)) {
    std::cout << std::quoted(word) << ' ';
}
// outputs: Hello C++ 20!

Required headers:

#include <ranges>
#include <string_view>

Reference: https://en.cppreference.com/w/cpp/ranges/split_view

  • Thanks! Anyone who wants to use this please check the reference link. The example is a bit different in the current docs. – Hitokage Nov 06 '22 at 14:47
  • 3
    This is incorrect. The OP was asking about using multiple delimiters. This answer uses the single, multi-character delimiter `-_-`. If you have the input `"Hello,C++;20,"` and use `std::views::split(words ",;"_sv)`, it won't split anything, because `,;` does not appear in the input. – Chris Jan 19 '23 at 16:00
0

Using Eric Niebler's range-v3 library:

https://godbolt.org/z/ZnxfSa

#include <string>
#include <iostream>
#include "range/v3/all.hpp"

int main()
{
    std::string s = "user1:192.168.0.1|user2:192.168.0.2|user3:192.168.0.3";
    auto words = s  
        | ranges::view::split('|')
        | ranges::view::transform([](auto w){
            return w | ranges::view::split(':');
        });
      ranges::for_each(words, [](auto i){ std::cout << i  << "\n"; });
}
Porsche9II
  • 629
  • 5
  • 17