28

I investigated some time here on StackOverflow to find good algorithms to split strings with multiple delimiters into a vector< string >. I also found some methods:

The Boost way:

boost::split(vector, string, boost::is_any_of(" \t"));

the getline method:

std::stringstream ss(string);
std::string item;
while(std::getline(ss, item, ' ')) {
    vector.push_back(item);
}

the tokenize way of Boost:

char_separator<char> sep(" \t");
tokenizer<char_separator<char>> tokens(string, sep);
BOOST_FOREACH(string t, tokens)
{
   vector.push_back(t);
}

and the cool STL way:

     istringstream iss(string);
     copy(istream_iterator<string>(iss),
     istream_iterator<string>(),
     back_inserter<vector<string> >(vector));

and the method of Shadow2531 (see the linked topic).

Most of them came from this topic. But they unfortunately don't solve my problem:

  • Boost's split is easy to use but with the big data (about 1.5*10^6 single elements in best cases) and about 10 delimiters I am using it's horrific slow.

  • The getline, STL and Shadow2531's method have the problem that I can only use one single char as delimiter. I need a few more.

  • Boost's tokenize is even more horrific in the aspect of speed. It took 11 seconds with 10 delimiters to split a string into 1.5*10^6 elements.

So I don't know what to do: I want to have a really fast string splitting algorithm with multiple delimiters.

Is Boost's split the maximum or is there a way to do it faster?

Community
  • 1
  • 1
Paul
  • 1,295
  • 2
  • 11
  • 26
  • Have you tried multi threading? For instance split your delim into N groups, and for each group run a thread to split the string by that delim group and populate a list, then recombine the lists afterwards? – Nick Banks Mar 31 '11 at 20:56
  • There must be some faster way. This: `cat loremipsum_big.txt | ruby -e "ary = Array.new; ARGF.each {|x| ary << x.split(/a| /)}; puts ary" | wc -l` creates 2.516.715 elements, pushes them into an array in 3.74 seconds in a Virtualboxed Ubuntu. – karatedog Mar 31 '11 at 21:17
  • 5
    How about some of the examples from the following article: http://www.codeproject.com/KB/recipes/Tokenizer.aspx They are very efficient. The String Toolkit Library makes complex string processing in C++ simple and easy. –  Jul 30 '11 at 09:46

3 Answers3

34

Two things come to mind:

  1. Use string views instead of strings as the split result, saves a lot of allocations.
  2. If you know you're only going to be working with chars (in the [0,255] range), try using a bitset to test membership instead of finding into the delimiter characters.

Here's a quick attempt at applying these ideas:

#include <vector>
#include <bitset>
#include <iostream>
#include <boost/algorithm/string/split.hpp>
#include <boost/algorithm/string/classification.hpp>
#include <boost/timer.hpp>

using namespace std;
size_t const N = 10000000;

template<typename C>
void test_custom(string const& s, char const* d, C& ret)
{
  C output;

  bitset<255> delims;
  while( *d )
  {
    unsigned char code = *d++;
    delims[code] = true;
  }
  typedef string::const_iterator iter;
  iter beg;
  bool in_token = false;
  for( string::const_iterator it = s.begin(), end = s.end();
    it != end; ++it )
  {
    if( delims[*it] )
    {
      if( in_token )
      {
        output.push_back(typename C::value_type(beg, it));
        in_token = false;
      }
    }
    else if( !in_token )
    {
      beg = it;
      in_token = true;
    }
  }
  if( in_token )
    output.push_back(typename C::value_type(beg, s.end()));
  output.swap(ret);
}

template<typename C>
void test_strpbrk(string const& s, char const* delims, C& ret)
{
  C output;

  char const* p = s.c_str();
  char const* q = strpbrk(p+1, delims);
  for( ; q != NULL; q = strpbrk(p, delims) )
  {
    output.push_back(typename C::value_type(p, q));
    p = q + 1;
  }

  output.swap(ret);
}

template<typename C>
void test_boost(string const& s, char const* delims)
{
  C output;
  boost::split(output, s, boost::is_any_of(delims));
}

int main()
{
  // Generate random text
  string text(N, ' ');
  for( size_t i = 0; i != N; ++i )
    text[i] = (i % 2 == 0)?('a'+(i/2)%26):((i/2)%2?' ':'\t');

  char const* delims = " \t[],-'/\\!\"§$%&=()<>?";

  // Output strings
  boost::timer timer;
  test_boost<vector<string> >(text, delims);
  cout << "Time: " << timer.elapsed() << endl;

  // Output string views
  typedef string::const_iterator iter;
  typedef boost::iterator_range<iter> string_view;
  timer.restart();
  test_boost<vector<string_view> >(text, delims);
  cout << "Time: " << timer.elapsed() << endl;

  // Custom split
  timer.restart();
  vector<string> vs;
  test_custom(text, delims, vs);
  cout << "Time: " << timer.elapsed() << endl;

  // Custom split
  timer.restart();
  vector<string_view> vsv;
  test_custom(text, delims, vsv);
  cout << "Time: " << timer.elapsed() << endl;

  // Custom split
  timer.restart();
  vector<string> vsp;
  test_strpbrk(text, delims, vsp);
  cout << "Time: " << timer.elapsed() << endl;

  // Custom split
  timer.restart();
  vector<string_view> vsvp;
  test_strpbrk(text, delims, vsvp);
  cout << "Time: " << timer.elapsed() << endl;

  return 0;
}

Compiling this with Boost 1.46.1 using GCC 4.5.1 with the -O4 flag enabled I get:

  • Time: 5.951 (Boost.Split + vector)
  • Time: 3.728 (Boost.Split + vector
  • Time: 1.662 (Custom split + vector)
  • Time: 0.144 (Custom split + vector)
  • Time: 2.13 (Strpbrk + vector)
  • Time: 0.527 (Strpbrk + vector)

NOTE: There's a slight difference in the output as empty tokens are dropped by my custom function. But you can adapt this code to your needs if you decide to use it.

Pablo
  • 8,644
  • 2
  • 39
  • 29
  • Hi Pablo, thanks for your answer! I tried to adapt your solution to my code but I failed using it properly: 1. problem: I don't know how to get the test_custom function to return C - I also don't know where to add the 'return'. 2. problem: vector seems to be better but I am assigning the strings so I need a vector or can I use the const_iterators of the strings? | I've pasted a simplified version of my code here: http://pastebin.com/0k6wNtFe – Paul Apr 01 '11 at 13:41
  • See my edit. It now takes an output parameter to assign it the output tokens. As to the string views, they should remain valid as long as the original string is valid. If you're going to be working with them using the Boost string algorithms, the views are considered valid strings, only they don't own their contents. I don't know what you mean by "assigning the strings", so if you could explain that a bit, you could get some suggestions on how to proceed. – Pablo Apr 01 '11 at 18:37
  • Thanks for editing your method. I adapted it and it kinda works but the whole function is making me trouble. strpbrk makes more sense for me to use as finding delimiters but somehow this messes things up. Also in this state it's nearly as fast/slow as boost's split method with vector. http://pastebin.com/WU9wfEqg – Paul Apr 01 '11 at 21:43
  • Ah, I almost forgot: Mostly this problem comes from the function strpbrk because it's not able to detect white spaces. How can I do that with that function? – Paul Apr 01 '11 at 21:45
  • Well, the problem is that using `strpbrk` inside the loop that way is making the function iterate over the string's contents far more times than are needed. The original funcion I showed you checked each character in the string once. Using `strpbrk` if a word has N characters, instead of checking N times, you're checking N*(N+1)/2. – Pablo Apr 01 '11 at 22:33
  • That made the difference - now it's running faster again. But I have a (hopefully the last) problem somewhere else: In your example you generate a 'valid' string that works with the delims[*it] method. I get my data from a string filled by reading an ifstream (http://pastebin.com/berTPPTD). If I try to use that string it fails with different errors (Seg Fault, Illegal instruction ...). How can I get it right? I also don't understand how bitset can influence if things can get out of range? Does N set the bit count or the size of the bitset? – Paul Apr 02 '11 at 12:05
  • Ok, see my final edit. I added an `strpbrk` based split. It is definitely faster than using `boost::split`, but still not as fast as the custom splitter. The reason for the segfault was that one of the delimiters you're using was above the [0-127] char range, and using it as an index into the bitset turned it into a negative value. That is fixed in this version. By the way, I added spaces and tabs to your delimiter list and `strpbrk` has no problem with them. – Pablo Apr 03 '11 at 19:11
  • Thanks, Pablo. If I could I would double the points you get for this answer but unfortunately I can't. :) – Paul Apr 04 '11 at 14:57
  • Would it kill performance to convert string_view back to a vector? How can one do that? –  Jun 30 '14 at 17:34
2

To combine the best parts of Pablo's and larsmans's answers, use a (offset, size) pair to store substrings and strcspn to get the extents of every entry.

MSN
  • 53,214
  • 7
  • 75
  • 105
1

On such large strings, it may pay off to use ropes instead. Or use string views as Pablo recommends: (char const*, size_t) pairs. The bitset trick is not necessary if you have a good implementation of strpbrk.

Fred Foo
  • 355,277
  • 75
  • 744
  • 836