Most efficient way to iterate over words in a string

Question

If I wanted to iterate over individual words in a string (separated by whitespace), then the obvious solution would be:

std::istringstream s(myString);

std::string word;
while (s >> word)
    do things

However that's quite inefficient. The entire string is copied while initializing the string stream, and then each extracted word is copied one at a time into the word variable (which is close to copying the entire string for a second time). Is there a way to improve on this without manually iterating over each character?

Do you need to full string for anything? If not you could just read it in as words from the get go. — NathanOliver, Feb 22 '17 at 17:18
What's wrong with "manually iterating over each character"? That's what `istringstream ::operator>>` probably does anyway (on top of copying the result into the `word` argument). — barak manos, Feb 22 '17 at 17:20
@Slava I agree that was poor phrasing on my part. I suppose I meant "more efficient than this way" — user1000039, Feb 22 '17 at 17:21
The first thing you should worry about is not "inefficiencies", but that the code is good, readable, maintainable and works. Then if (but only if) the program is slower than some requirement, you measure and benchmark and profile to find the hotspots and bottlenecks and start working there. — Some programmer dude, Feb 22 '17 at 17:22

score 5 · Accepted Answer · answered Feb 22 '17 at 17:29

In most cases, copying represents a very small percentage of the overall costs, so having a clean, highly readable code becomes more important. In rare cases when the time profiler tells you that copying creates a bottleneck, you can iterate over characters in the string with some help from the standard library.

One approach that you could take is to iterate with std::string::find_first_of and std::string::find_first_not_of member functions, like this:

const std::string s = "quick \t\t brown \t fox jumps over the\nlazy dog";
const std::string ws = " \t\r\n";
std::size_t pos = 0;
while (pos != s.size()) {
    std::size_t from = s.find_first_not_of(ws, pos);
    if (from == std::string::npos) {
        break;
    }
    std::size_t to = s.find_first_of(ws, from+1);
    if (to == std::string::npos) {
        to = s.size();
    }
    // If you want an individual word, copy it with substr.
    // The code below simply prints it character-by-character:
    std::cout << "'";
    for (std::size_t i = from ; i != to ; i++) {
        std::cout << s[i];
    }
    std::cout << "'" << std::endl;
    pos = to;
}

Demo.

Unfortunately, the code becomes a lot harder to read, so you should avoid this change, or at least postpone it until it becomes requried.

I had a feeling that there may have been a simple method hidden somewhere in the localization part of STL, but if not then string streams are definitely the way to go. Thanks for the demo though — user1000039, Feb 22 '17 at 17:53
This can be done much easier (and propably as efficient) using boost string algorithms. See my answer. — zett42, Feb 23 '17 at 00:40

zett42 · Answer 2 · 2017-02-23T00:29:39.743

Using boost string algorithms we can write it as follows. The loop doesn't involve any copying of the string.

#include <string>
#include <iostream>
#include <boost/algorithm/string.hpp>

int main()
{
    std::string s = "stack over   flow";

    auto it = boost::make_split_iterator( s, boost::token_finder( 
                          boost::is_any_of( " " ), boost::algorithm::token_compress_on ) );
    decltype( it ) end;

    for( ; it != end; ++it ) 
    {
        std::cout << "word: '" << *it << "'\n";
    }

    return 0;
}

Making it C++11-ish

Since pairs of iterators are so oldschool nowadays, we may use boost.range to define some generic helper functions. These finally allow us to loop over the words using range-for:

#include <string>
#include <iostream>
#include <boost/algorithm/string.hpp>
#include <boost/range/iterator_range_core.hpp>

template< typename Range >
using SplitRange = boost::iterator_range< boost::split_iterator< typename Range::const_iterator > >;

template< typename Range, typename Finder >
SplitRange< Range > make_split_range( const Range& rng, const Finder& finder )
{
    auto first = boost::make_split_iterator( rng, finder );
    decltype( first ) last;
    return {  first, last };
}

template< typename Range, typename Predicate >
SplitRange< Range > make_token_range( const Range& rng, const Predicate& pred )
{
    return make_split_range( rng, boost::token_finder( pred, boost::algorithm::token_compress_on ) );
}

int main()
{
    std::string str = "stack \tover\r\n  flow";

    for( const auto& substr : make_token_range( str, boost::is_any_of( " \t\r\n" ) ) )
    {
        std::cout << "word: '" << substr << "'\n";
    }

    return 0;
}

Demo:

http://coliru.stacked-crooked.com/a/2f4b3d34086cc6ec

score 0 · Answer 3 · answered Feb 23 '17 at 00:57

0

If you want to have it as fast as possible, you need to fall back to the good old C function strtok() (or its thread-safe companion strtok_r()):

const char* kWhiteSpace = " \t\v\n\r";    //whatever you call white space

char* token = std::strtok(myString.data(), kWhiteSpace);
while(token) {
    //do things with token
    token = std::strtok(nullptr, kWhiteSpace));
}

Beware that this will clobber the contents of myString: It works by replacing the first delimiter character after each token with a terminating null byte, and returning a pointer to the start of the tokens in turn. This is a legacy C function after all.

However, that weakness is also its strength: It does not perform any copy, nor does it allocate any dynamic memory (which likely is the most time consuming thing in your example code). As such, you won't find a native C++ method that beats strtok()'s speed.

answered Feb 23 '17 at 00:57

cmaster - reinstate monica

38,891
9
62
106

According to [C++ reference docs](http://en.cppreference.com/w/cpp/string/basic_string/data) making modifications to the array obtained through the call to data results in undefined behavior, so a copy into a temporary array may be needed. I would also stay awsy from non-reentrant strtok, preferring strtok_r instead. – Sergey Kalinichenko Feb 23 '17 at 01:14
1

This refers only to the const overload of string::data(). If you have a fancy C++17 compiler, there is a non-const overload that returns a pointer to a non-const array which you are allowed to modify. – zett42 Feb 23 '17 at 01:50
@dasblinkenlight Exactly. I'm using the non-const overload, as I'm passing the resulting `char*` to `std::strtok()`, which *requires* a non-const pointer as its first argument. So, no UB. Obviously, I agree with you on prefering `strtok_r()`, but a) it does not seem to be available via the `std::` namespace (which would not stop me from actually importing and using the C function, of course), and b) it's not necessary if you don't have multiple threads anyway (yes, there is real software that's either designed to use multiprocessing over multithreading, or where multithreading is pointless). – cmaster - reinstate monica Feb 23 '17 at 09:43

score -1 · Answer 4 · edited May 23 '17 at 12:25

-1

What about spliting the string? You can check this post for more information.

Inside this post there is a detailed answer about how to split a string in tokens. In this answer maybe you could check the second way using iterators and the copy algorithm.

edited May 23 '17 at 12:25

Community

1
1

answered Feb 22 '17 at 17:55

N. Gerontidis

107
4

Most efficient way to iterate over words in a string

4 Answers4