C++ Tokenizer Complexity vs strtok_r

Question

I'm making this question because I moved my tokenizer from strtok_r to an equivalent version in C++. I have to use strtok_r in place of strtok, because I have 2 nested tokenizations to perform most of the time.

The strtok_r algorithm is something like this:

char *end_token, *token, *word ;
// fill 'word'
token = strtok_r (word, " ", &end_token) ;
while (token != NULL) {
  // do something
  token = strtok_r (NULL, " ", &end_token) ;
}

And the C++ version is like this (taken from another post here):

string mystring, token ;
size_t next_token ;
// fill 'mystring'
while (token != mystring) {
    next_token = mystring.find_first_of (" ") ;
    token = mystring.substr (0, next_token) ;
    mystring = mystring.substr (next_token + 1) ;
    // do something
}

Now the question: why the C++ version is so heavy respect to the C version? For long strings I have to wait about 10 seconds with the C++ version, while the C version is instantaneous with the same strings. So, it seems like the C++ version has higher complexity... What do you think about?

`strtok` modifies the input string, while the C++ version is making copies. Whlie they may have the same end result, the two versions are taking vastly different routes to reach the end. — Chad, Aug 19 '15 at 17:45
You might want to see this on how to split a string C++ as it is more efficient: http://stackoverflow.com/questions/236129/split-a-string-in-c What does you data look like and what are you trying to do with it? — NathanOliver, Aug 19 '15 at 17:47
If you are testing on constant strings, functions with equivalent __builtin_'s will always win because the get folded. — technosaurus, Aug 19 '15 at 17:55
Maybe because C library was written at a time where resource was scarce and efficiency was preferred to a neat design, while C++ is more concerned at robustness and design - but this is just my opinion :-) — Serge Ballesta, Aug 19 '15 at 17:57

Christophe · Answer 1 · 2015-08-19T19:01:37.483

strtok() modifies the string, replacing the token delimiter with a null terminator. If your long string has n tokens, the function just iterates through the string changing n chars to null, which is extremely fast.

In your C++ alternative you are making 2*n string copies, which means potentially 2*n allocation operations, plus the sheer copy of the (very long) remaining string, which is much heavier than the first alternative. The difference is that you're not obliged to change the original string.

You could improve by keeping the string you're iterating trough unchanged and for example use offsets for the search:

string mystring, token ;
size_t cur_token=0, next_token ;
// fill 'mystring'
do {
    next_token = mystring.find_first_of (" ", cur_token) ;
    token = mystring.substr (cur_token, next_token-cur_token);  // next_token-(nex_token==string::npos ? 0:cur_token) would be cleaner
    if (next_token!=string::npos) 
        cur_token = next_token+1; 
    // do something with token;
} while (next_token!=string::npos);

Live demo

Thank you! I will try it. It seems very interesting! I didn't catch that "my" algorithm had to perform such big number of copies and allocations... :( — nicolati, Aug 20 '15 at 18:55
Hi Christophe, I tried your algorithm. It's much better than "mine", but still worse than strtok. Since it's about 5X faster than "mine" and maybe 2X slower than strtok, I will use it anyway. I can survive with 2X :) Thank you — nicolati, Aug 21 '15 at 18:09

C++ Tokenizer Complexity vs strtok_r

1 Answers1