83

What would be easiest method to split a string using c++11?

I've seen the method used by this post, but I feel that there ought to be a less verbose way of doing it using the new standard.

Edit: I would like to have a vector<string> as a result and be able to delimitate on a single character.

Community
  • 1
  • 1
Mark
  • 3,177
  • 4
  • 26
  • 37
  • 1
    Splitting on spaces? And I don't think C++11 added anything here, think [the accepted answer](http://stackoverflow.com/a/237280/845092) is still the best way. – Mooing Duck Feb 24 '12 at 17:42
  • what do you want to after you split? print to cout? or get a vector of substrings? – balki Feb 24 '12 at 17:42
  • Isn't this what Regular Expression parsing is for? – Nicol Bolas Feb 24 '12 at 17:46
  • 1
    I think the [most voted answer](http://stackoverflow.com/a/236803/612920) is the best – Mansuro Feb 09 '17 at 10:40
  • 1
    Possible duplicate of [The most elegant way to iterate the words of a string](https://stackoverflow.com/questions/236129/the-most-elegant-way-to-iterate-the-words-of-a-string) – underscore_d Oct 07 '18 at 20:50
  • With c++17 you have [string_view](https://en.cppreference.com/w/cpp/header/string_view) which would make the string copies redundant (with caveats) or if you don't mind adding a library, you could use abseil https://abseil.io/tips/10 . I'd make this answer but the question specifically asks for c++11. – Paul Rooney Sep 22 '20 at 23:45

11 Answers11

67

std::regex_token_iterator performs generic tokenization based on a regex. It may or may not be overkill for doing simple splitting on a single character, but it works and is not too verbose:

std::vector<std::string> split(const string& input, const string& regex) {
    // passing -1 as the submatch index parameter performs splitting
    std::regex re(regex);
    std::sregex_token_iterator
        first{input.begin(), input.end(), re, -1},
        last;
    return {first, last};
}
MM.
  • 4,224
  • 5
  • 37
  • 74
JohannesD
  • 13,802
  • 1
  • 38
  • 30
  • 3
    Should mention that this is MSFT-specific. Doesn't exist on POSIX systems. – jackyalcine Sep 06 '14 at 20:11
  • Looks like it is also available in [boost.](http://www.boost.org/doc/libs/1_56_0/libs/regex/doc/html/boost_regex/ref/regex_token_iterator.html) – phs Sep 06 '14 at 20:57
  • 16
    [`regex_token_iterator`](http://en.cppreference.com/w/cpp/regex/regex_token_iterator) is defined in C++11, but GCC doesn't support it natively until version 4.9 (see [here](http://stackoverflow.com/a/8913707/86967na)). With earlier versions of GCC, you can use [Boost regex](http://www.boost.org/doc/libs/release/libs/regex/). – Brent Bradburn Dec 05 '14 at 22:30
  • A good regex initializer would be`" +"`, for "one or more spaces". – Brent Bradburn Dec 05 '14 at 23:30
  • 2
    A good'er regex would be `\\s+` for whitespace. Also, on gcc 4.9 I have to explicitly initialize a regex with the string parameter, before passing it to the iterator constructor. Just add `regex re{regex_str};` as a first line, where `regex_str` is the string called `regex` in the example, then pass `re`. – Alfred Bratterud Feb 11 '15 at 19:14
  • You need gcc 4.9 for this. – lppier May 23 '17 at 07:01
  • This works great, but even after reading the docs - I don't get the syntax of the line starting `std::sregex_token_iterator...`. Is this two iterators called first and last? and why is `last` not "set" to any value - I assume there is some sort of default...? – code_fodder Sep 03 '18 at 16:45
  • Oh wait - I got the last bit now after re-reading... it defaults to `end of sequece`. So I am assuming this could be re-written: `std::sregex_token_iterator first{input.begin(), input.end(), re, -1};` and `std::sregex_token_iterator last;`...? – code_fodder Sep 03 '18 at 16:47
  • @code_fodder Yes, a(ny) default-constructed instance functions as an end iterator here. This is also the case with stream iterators and other cases where there's no definitive "end" known beforehand. – JohannesD Sep 04 '18 at 17:16
  • a great solution, but should be aware that the second regex param(as delimiter) is treated as regex express, which means, if you delimiter is something like "|", then, you can't just pass "|"(regex special char), you should use escape charater for it, it shoud be "\\|" – Tony_Tong Sep 21 '18 at 14:53
  • Good solution. However if you concern performance, use boost::regex or else. – heLomaN Nov 07 '18 at 14:09
  • Sorry, way too complicated. – Niclas Apr 05 '21 at 09:34
38

Here is a (maybe less verbose) way to split string (based on the post you mentioned).

#include <string>
#include <sstream>
#include <vector>
std::vector<std::string> split(const std::string &s, char delim) {
  std::stringstream ss(s);
  std::string item;
  std::vector<std::string> elems;
  while (std::getline(ss, item, delim)) {
    elems.push_back(item);
    // elems.push_back(std::move(item)); // if C++11 (based on comment from @mchiasson)
  }
  return elems;
}
Community
  • 1
  • 1
Yaguang
  • 692
  • 6
  • 7
  • 12
    If you are using C++11, you could also do this to avoid string copies when inserting into your vector: elems.push_back(std::move(item)); – mchiasson Feb 15 '15 at 16:03
  • Firstly, `std::move ` doesn't move any thing, it just cast the type. `item` is defined on the stack, it will be copied to vector's heap. So using `std::move` here won't avoid copy. – Jack Zhang Dec 01 '20 at 09:14
  • 3
    Even though `item` is defined on the stack, the internal data pointer points to data allocated on the heap. By using `std::move`, the `push_back(std::string&&)` overload will be selected, causing the `std::string` object inside the vector to be initialized by move -- simply copying the data pointer, rather than copying the entire buffer. – Tyg13 Apr 26 '21 at 02:22
  • 1
    Is the compiler not capable of doing this `std::move` optimization automatically? – dshin Oct 27 '22 at 21:54
  • If the string ends in a delimiter (e.g., an empty csv column at the end of the line), it does not return the empty string. It simply returns one fewer string. For example: 1,2,3,4\nA,B,C, – Paul Nov 12 '22 at 17:44
22

Here's an example of splitting a string and populating a vector with the extracted elements using boost.

#include <boost/algorithm/string.hpp>

std::string my_input("A,B,EE");
std::vector<std::string> results;

boost::algorithm::split(results, my_input, boost::is_any_of(","));

assert(results[0] == "A");
assert(results[1] == "B");
assert(results[2] == "EE");
rezaebrh
  • 424
  • 2
  • 6
  • 19
fduff
  • 3,671
  • 2
  • 30
  • 39
19

Another regex solution inspired by other answers but hopefully shorter and easier to read:

std::string s{"String to split here, and here, and here,..."};
std::regex regex{R"([\s,]+)"}; // split on space and comma
std::sregex_token_iterator it{s.begin(), s.end(), regex, -1};
std::vector<std::string> words{it, {}};
Delgan
  • 18,571
  • 11
  • 90
  • 141
wally
  • 10,717
  • 5
  • 39
  • 72
  • Nice answer. Where is this syntax: `words{it, {}};` described for initializing a vector? – Gardener Mar 28 '19 at 10:20
  • 3
    Found an answer here: [empty curly braces as end of range](https://stackoverflow.com/questions/30124122/empty-curly-bracket-as-end-of-range) – Gardener Mar 28 '19 at 13:39
  • 2
    In case anyone else is wondering: the `-1` argument of the `sregex_token_iterator` constructor causes the object to iterate over the fragments _between_ matches. The default value of `0` would iterate over fragments matching the regex. See [here](https://en.cppreference.com/w/cpp/regex/regex_token_iterator) for more details. – xperroni Dec 01 '20 at 16:29
6

I don't know if this is less verbose, but it might be easier to grok for those more seasoned in dynamic languages such as javascript. The only C++11 features it uses is auto and range-based for loop.

#include <string>
#include <cctype>
#include <iostream>
#include <vector>

using namespace std;

int main()
{
  string s = "hello  how    are you won't you tell me your name";
  vector<string> tokens;
  string token;

  for (const auto& c: s) {
    if (!isspace(c))
      token += c;
    else {
      if (token.length()) tokens.push_back(token);
      token.clear();
    }
  }

  if (token.length()) tokens.push_back(token);
     
  return 0;
}
dmitry_romanov
  • 5,146
  • 1
  • 33
  • 36
Faisal Vali
  • 32,723
  • 8
  • 42
  • 45
4
#include <iostream>
#include <algorithm>
#include <vector>
#include <string>


using namespace std;

vector<string> split(const string& str, int delimiter(int) = ::isspace){
  vector<string> result;
  auto e=str.end();
  auto i=str.begin();
  while(i!=e){
    i=find_if_not(i,e, delimiter);
    if(i==e) break;
    auto j=find_if(i,e, delimiter);
    result.push_back(string(i,j));
    i=j;
  }
  return result;
}

int main(){
  string line;
  getline(cin,line);
  vector<string> result = split(line);
  for(auto s: result){
    cout<<s<<endl;
  }
}
chekkal
  • 187
  • 6
  • Why `int` as delimiter, and why `int delimiter(int)` the `(int)`? – Ela782 Aug 22 '17 at 22:22
  • 2
    @Ela782 it's a function pointer argument, a function that accepts an int parameter and returns int. The default is the isspace function. – Fsmv Sep 04 '17 at 21:40
4

My choice is boost::tokenizer but I didn't have any heavy tasks and test with huge data. Example from boost doc with lambda modification:

#include <iostream>
#include <boost/tokenizer.hpp>
#include <string>
#include <vector>

int main()
{
   using namespace std;
   using namespace boost;

   string s = "This is,  a test";
   vector<string> v;
   tokenizer<> tok(s);
   for_each (tok.begin(), tok.end(), [&v](const string & s) { v.push_back(s); } );
   // result 4 items: 1)This 2)is 3)a 4)test
   return 0;
}
fduff
  • 3,671
  • 2
  • 30
  • 39
Torsten
  • 21,726
  • 5
  • 24
  • 31
3
#include <string>
#include <vector>
#include <sstream>

inline vector<string> split(const string& s) {
    vector<string> result;
    istringstream iss(s);
    for (string w; iss >> w; )
        result.push_back(w);
    return result;
}
Bimo
  • 5,987
  • 2
  • 39
  • 61
  • 1
    I think this post : https://stackoverflow.com/questions/11719538/how-to-use-stringstream-to-separate-comma-separated-strings does it better using istringstream – eminemence Aug 01 '18 at 14:39
  • After renaming the second `s` to say `w` it worked nicely. Please update the answer so that it compiles everywhere. – Carlos Pinzón May 13 '19 at 22:31
  • I think for performance you can write: `result.emplace_back(std::move(w));` – Кое Кто Jun 23 '23 at 16:34
2

This is my answer. Verbose, readable and efficient.

std::vector<std::string> tokenize(const std::string& s, char c) {
    auto end = s.cend();
    auto start = end;

    std::vector<std::string> v;
    for( auto it = s.cbegin(); it != end; ++it ) {
        if( *it != c ) {
            if( start == end )
                start = it;
            continue;
        }
        if( start != end ) {
            v.emplace_back(start, it);
            start = end;
        }
    }
    if( start != end )
        v.emplace_back(start, end);
    return v;
}
ymmt2005
  • 174
  • 9
2

Here is a C++11 solution that uses only std::string::find(). The delimiter can be any number of characters long. Parsed tokens are output via an output iterator, which is typically a std::back_inserter in my code.

I have not tested this with UTF-8, but I expect it should work as long as the input and delimiter are both valid UTF-8 strings.

#include <string>

template<class Iter>
Iter splitStrings(const std::string &s, const std::string &delim, Iter out)
{
    if (delim.empty()) {
        *out++ = s;
        return out;
    }
    size_t a = 0, b = s.find(delim);
    for ( ; b != std::string::npos;
          a = b + delim.length(), b = s.find(delim, a))
    {
        *out++ = std::move(s.substr(a, b - a));
    }
    *out++ = std::move(s.substr(a, s.length() - a));
    return out;
}

Some test cases:

void test()
{
    std::vector<std::string> out;
    size_t counter;

    std::cout << "Empty input:" << std::endl;        
    out.clear();
    splitStrings("", ",", std::back_inserter(out));
    counter = 0;        
    for (auto i = out.begin(); i != out.end(); ++i, ++counter) {
        std::cout << counter << ": " << *i << std::endl;
    }

    std::cout << "Non-empty input, empty delimiter:" << std::endl;        
    out.clear();
    splitStrings("Hello, world!", "", std::back_inserter(out));
    counter = 0;        
    for (auto i = out.begin(); i != out.end(); ++i, ++counter) {
        std::cout << counter << ": " << *i << std::endl;
    }

    std::cout << "Non-empty input, non-empty delimiter"
                 ", no delimiter in string:" << std::endl;        
    out.clear();
    splitStrings("abxycdxyxydefxya", "xyz", std::back_inserter(out));
    counter = 0;        
    for (auto i = out.begin(); i != out.end(); ++i, ++counter) {
        std::cout << counter << ": " << *i << std::endl;
    }

    std::cout << "Non-empty input, non-empty delimiter"
                 ", delimiter exists string:" << std::endl;        
    out.clear();
    splitStrings("abxycdxy!!xydefxya", "xy", std::back_inserter(out));
    counter = 0;        
    for (auto i = out.begin(); i != out.end(); ++i, ++counter) {
        std::cout << counter << ": " << *i << std::endl;
    }

    std::cout << "Non-empty input, non-empty delimiter"
                 ", delimiter exists string"
                 ", input contains blank token:" << std::endl;        
    out.clear();
    splitStrings("abxycdxyxydefxya", "xy", std::back_inserter(out));
    counter = 0;        
    for (auto i = out.begin(); i != out.end(); ++i, ++counter) {
        std::cout << counter << ": " << *i << std::endl;
    }

    std::cout << "Non-empty input, non-empty delimiter"
                 ", delimiter exists string"
                 ", nothing after last delimiter:" << std::endl;        
    out.clear();
    splitStrings("abxycdxyxydefxy", "xy", std::back_inserter(out));
    counter = 0;        
    for (auto i = out.begin(); i != out.end(); ++i, ++counter) {
        std::cout << counter << ": " << *i << std::endl;
    }

    std::cout << "Non-empty input, non-empty delimiter"
                 ", only delimiter exists string:" << std::endl;        
    out.clear();
    splitStrings("xy", "xy", std::back_inserter(out));
    counter = 0;        
    for (auto i = out.begin(); i != out.end(); ++i, ++counter) {
        std::cout << counter << ": " << *i << std::endl;
    }
}

Expected output:

Empty input:
0: 
Non-empty input, empty delimiter:
0: Hello, world!
Non-empty input, non-empty delimiter, no delimiter in string:
0: abxycdxyxydefxya
Non-empty input, non-empty delimiter, delimiter exists string:
0: ab
1: cd
2: !!
3: def
4: a
Non-empty input, non-empty delimiter, delimiter exists string, input contains blank token:
0: ab
1: cd
2: 
3: def
4: a
Non-empty input, non-empty delimiter, delimiter exists string, nothing after last delimiter:
0: ab
1: cd
2: 
3: def
4: 
Non-empty input, non-empty delimiter, only delimiter exists string:
0: 
1: 
villains
  • 21
  • 2
0

One possible way of doing this is finding all occurrences of the split string and storing locations to a list. Then count input string characters and when you get to a position where there is a 'search hit' in the position list then you jump forward by 'length of the split string'. This approach takes a split string of any length. Here is my tested and working solution.

#include <iostream>
#include <string>
#include <list>
#include <vector>

using namespace std;

vector<string> Split(string input_string, string search_string)
{
    list<int> search_hit_list;
    vector<string> word_list;
    size_t search_position, search_start = 0;

    // Find start positions of every substring occurence and store positions to a hit list.
    while ( (search_position = input_string.find(search_string, search_start) ) != string::npos) {
        search_hit_list.push_back(search_position);
        search_start = search_position + search_string.size();
    }

    // Iterate through hit list and reconstruct substring start and length positions
    int character_counter = 0;
    int start, length;

    for (auto hit_position : search_hit_list) {

        // Skip over substrings we are splitting with. This also skips over repeating substrings.
        if (character_counter == hit_position) {
            character_counter = character_counter + search_string.size();
            continue;
        }

        start = character_counter;
        character_counter = hit_position;
        length = character_counter - start;
        word_list.push_back(input_string.substr(start, length));
        character_counter = character_counter + search_string.size();
    }

    // If the search string is not found in the input string, then return the whole input_string.
    if (word_list.size() == 0) {
            word_list.push_back(input_string);
            return word_list;
    }
    // The last substring might be still be unprocessed, get it.
    if (character_counter < input_string.size()) {
        word_list.push_back(input_string.substr(character_counter, input_string.size() - character_counter));
    }

    return word_list;
}

int main() {

    vector<string> word_list;
    string search_string = " ";
    // search_string = "the";
    string text = "thetheThis is  some   text     to test  with the    split-thethe   function.";

    word_list = Split(text, search_string);

    for (auto item : word_list) {
        cout << "'" << item << "'" << endl;
    }

    cout << endl;
}
mhartzel
  • 1
  • 2