101

I'm trying to convert some code from Python to C++ in an effort to gain a little bit of speed and sharpen my rusty C++ skills. Yesterday I was shocked when a naive implementation of reading lines from stdin was much faster in Python than C++ (see this). Today, I finally figured out how to split a string in C++ with merging delimiters (similar semantics to python's split()), and am now experiencing deja vu! My C++ code takes much longer to do the work (though not an order of magnitude more, as was the case for yesterday's lesson).

Python Code:

#!/usr/bin/env python
from __future__ import print_function                                            
import time
import sys

count = 0
start_time = time.time()
dummy = None

for line in sys.stdin:
    dummy = line.split()
    count += 1

delta_sec = int(time.time() - start_time)
print("Python: Saw {0} lines in {1} seconds. ".format(count, delta_sec), end='')
if delta_sec > 0:
    lps = int(count/delta_sec)
    print("  Crunch Speed: {0}".format(lps))
else:
    print('')

C++ Code:

#include <iostream>                                                              
#include <string>
#include <sstream>
#include <time.h>
#include <vector>

using namespace std;

void split1(vector<string> &tokens, const string &str,
        const string &delimiters = " ") {
    // Skip delimiters at beginning
    string::size_type lastPos = str.find_first_not_of(delimiters, 0);

    // Find first non-delimiter
    string::size_type pos = str.find_first_of(delimiters, lastPos);

    while (string::npos != pos || string::npos != lastPos) {
        // Found a token, add it to the vector
        tokens.push_back(str.substr(lastPos, pos - lastPos));
        // Skip delimiters
        lastPos = str.find_first_not_of(delimiters, pos);
        // Find next non-delimiter
        pos = str.find_first_of(delimiters, lastPos);
    }
}

void split2(vector<string> &tokens, const string &str, char delim=' ') {
    stringstream ss(str); //convert string to stream
    string item;
    while(getline(ss, item, delim)) {
        tokens.push_back(item); //add token to vector
    }
}

int main() {
    string input_line;
    vector<string> spline;
    long count = 0;
    int sec, lps;
    time_t start = time(NULL);

    cin.sync_with_stdio(false); //disable synchronous IO

    while(cin) {
        getline(cin, input_line);
        spline.clear(); //empty the vector for the next line to parse

        //I'm trying one of the two implementations, per compilation, obviously:
//        split1(spline, input_line);  
        split2(spline, input_line);

        count++;
    };

    count--; //subtract for final over-read
    sec = (int) time(NULL) - start;
    cerr << "C++   : Saw " << count << " lines in " << sec << " seconds." ;
    if (sec > 0) {
        lps = count / sec;
        cerr << "  Crunch speed: " << lps << endl;
    } else
        cerr << endl;
    return 0;

//compiled with: g++ -Wall -O3 -o split1 split_1.cpp

Note that I tried two different split implementations. One (split1) uses string methods to search for tokens and is able to merge multiple tokens as well as handle numerous tokens (it comes from here). The second (split2) uses getline to read the string as a stream, doesn't merge delimiters, and only supports a single delimeter character (that one was posted by several StackOverflow users in answers to string splitting questions).

I ran this multiple times in various orders. My test machine is a Macbook Pro (2011, 8GB, Quad Core), not that it matters much. I'm testing with a 20M line text file with three space-separated columns that each look similar to this: "foo.bar 127.0.0.1 home.foo.bar"

Results:

$ /usr/bin/time cat test_lines_double | ./split.py
       15.61 real         0.01 user         0.38 sys
Python: Saw 20000000 lines in 15 seconds.   Crunch Speed: 1333333
$ /usr/bin/time cat test_lines_double | ./split1
       23.50 real         0.01 user         0.46 sys
C++   : Saw 20000000 lines in 23 seconds.  Crunch speed: 869565
$ /usr/bin/time cat test_lines_double | ./split2
       44.69 real         0.02 user         0.62 sys
C++   : Saw 20000000 lines in 45 seconds.  Crunch speed: 444444

What am I doing wrong? Is there a better way to do string splitting in C++ that does not rely on external libraries (i.e. no boost), supports merging sequences of delimiters (like python's split), is thread safe (so no strtok), and whose performance is at least on par with python?

Edit 1 / Partial Solution?:

I tried making it a more fair comparison by having python reset the dummy list and append to it each time, as C++ does. This still isn't exactly what the C++ code is doing, but it's a bit closer. Basically, the loop is now:

for line in sys.stdin:
    dummy = []
    dummy += line.split()
    count += 1

The performance of python is now about the same as the split1 C++ implementation.

/usr/bin/time cat test_lines_double | ./split5.py
       22.61 real         0.01 user         0.40 sys
Python: Saw 20000000 lines in 22 seconds.   Crunch Speed: 909090

I still am surprised that, even if Python is so optimized for string processing (as Matt Joiner suggested), that these C++ implementations would not be faster. If anyone has ideas about how to do this in a more optimal way using C++, please share your code. (I think my next step will be trying to implement this in pure C, although I'm not going to trade off programmer productivity to re-implement my overall project in C, so this will just be an experiment for string splitting speed.)

Thanks to all for your help.

Final Edit/Solution:

Please see Alf's accepted answer. Since python deals with strings strictly by reference and STL strings are often copied, performance is better with vanilla python implementations. For comparison, I compiled and ran my data through Alf's code, and here is the performance on the same machine as all the other runs, essentially identical to the naive python implementation (though faster than the python implementation that resets/appends the list, as shown in the above edit):

$ /usr/bin/time cat test_lines_double | ./split6
       15.09 real         0.01 user         0.45 sys
C++   : Saw 20000000 lines in 15 seconds.  Crunch speed: 1333333

My only small remaining gripe is regarding the amount of code necessary to get C++ to perform in this case.

One of the lessons here from this issue and yesterday's stdin line reading issue (linked above) are that one should always benchmark instead of making naive assumptions about languages' relative "default" performance. I appreciate the education.

Thanks again to all for your suggestions!

Community
  • 1
  • 1
JJC
  • 9,547
  • 8
  • 48
  • 53
  • 2
    How did you compile the C++ program? Do you have optimizations turned on? – interjay Feb 21 '12 at 13:39
  • 2
    @interjay: It's in the last comment in his source: `g++ -Wall -O3 -o split1 split_1.cpp` @JJC: How does your benchmark fare when you actually use `dummy` and `spline` respectively, maybe Python removes the call to `line.split()` because it has no side-effects? – Eric Feb 21 '12 at 13:40
  • 2
    What results do you get if you remove the splitting, and leave only reading lines from stdin? – interjay Feb 21 '12 at 13:51
  • 2
    Python is written in C. It means that there is a efficient way of doing it, in C. Maybe there is a better way to split a string than using STL ? – ixe013 Feb 21 '12 at 13:55
  • @interjay Please see my question from yesterday (linked in my question near the top). Just reading lines from stdin was a bit faster in C++ than Python after turning off io synchronization, i.e. cin.sync_with_stdio(false). – JJC Feb 21 '12 at 13:59
  • It might have to do with push_back that resizes the vector all the time ? Try `reserve(200)` and see what that gets you. – ixe013 Feb 21 '12 at 14:04
  • @ixe013: That's exactly what I think as well. I had given a solution to either use `reserve()` or use `std::list` to push-back sentences and assign the list to the vector at the end. – Vite Falcon Feb 21 '12 at 14:08
  • 3
    possible duplicate of [Why do std::string operations perform poorly?](http://stackoverflow.com/questions/8310039/why-do-stdstring-operations-perform-poorly) – Matt Joiner Feb 21 '12 at 14:14
  • @JJC: I have edited my answer to include what I believe is to be added to your Python code to make it a 'fair' comparison against C++. – Vite Falcon Feb 21 '12 at 14:19
  • @MattJoiner I didn't see that question before, although I think the discussion generated here including your comments and those of Alf are worth preserving in their own right. The string splitting code is also likely useful to python programmers learning C++ and which might search for these exact phrases (i.e. python, c++, splitting). – JJC Feb 21 '12 at 14:58
  • @JJC: Can you try one more edit mentioned in my post? – Vite Falcon Feb 21 '12 at 15:11
  • On my system, `split2` is a bit faster than `split.py`, and a hand-written split that matches Python's functionality exactly is more than twice as fast as `split.py`. – n. m. could be an AI Feb 21 '12 at 15:36
  • @n.m. Hmmm, I wonder if this is data-dependent. I ran my benshmarks on two different machines and got consistent results. Could you please share your hand-written split code? – JJC Feb 21 '12 at 15:45
  • @ViteFalcon Tried it, posted results at your answer. – JJC Feb 21 '12 at 15:45
  • @JJC: Thanks for trying it out. Clears some of my doubts as well. A very good question and good to know you found an answer. Up-voted and starred :) – Vite Falcon Feb 21 '12 at 15:47
  • @ViteFalcon Thanks and thanks for your suggestions! This has been eye-opening. – JJC Feb 21 '12 at 15:51
  • My code is similar to that in the accepted answer, except I'm just using normal `std::string`s everywhere and not StringRef. – n. m. could be an AI Feb 21 '12 at 15:54
  • @n.m. Can you share it regardless? If you didn't need to write your own StringRef class then your code will be more parsimonious and thus a better solution for most people. Please share, I promise to upvote it. ;-) – JJC Feb 21 '12 at 16:02
  • @JJC: the following article gives a pretty good implementation of string splitting in c++: http://www.codeproject.com/Articles/23198/C-String-Toolkit-StrTk-Tokenizer – Zamfir Kerlukson Sep 24 '12 at 01:24
  • Part of the problem is that you're not "using" the data. If you see the results when counting the number of words and characters ( https://github.com/tobbez/string-splitting/pull/2 ), then almost all of the C/C++ versions beat the Python versions. – Dave Johansen Apr 15 '15 at 05:24

8 Answers8

62

As a guess, Python strings are reference counted immutable strings, so that no strings are copied around in the Python code, while C++ std::string is a mutable value type, and is copied at the smallest opportunity.

If the goal is fast splitting, then one would use constant time substring operations, which means only referring to parts of the original string, as in Python (and Java, and C#…).

The C++ std::string class has one redeeming feature, though: it is standard, so that it can be used to pass strings safely and portably around where efficiency is not a main consideration. But enough chat. Code -- and on my machine this is of course faster than Python, since Python's string handling is implemented in C which is a subset of C++ (he he):

#include <iostream>                                                              
#include <string>
#include <sstream>
#include <time.h>
#include <vector>

using namespace std;

class StringRef
{
private:
    char const*     begin_;
    int             size_;

public:
    int size() const { return size_; }
    char const* begin() const { return begin_; }
    char const* end() const { return begin_ + size_; }

    StringRef( char const* const begin, int const size )
        : begin_( begin )
        , size_( size )
    {}
};

vector<StringRef> split3( string const& str, char delimiter = ' ' )
{
    vector<StringRef>   result;

    enum State { inSpace, inToken };

    State state = inSpace;
    char const*     pTokenBegin = 0;    // Init to satisfy compiler.
    for( auto it = str.begin(); it != str.end(); ++it )
    {
        State const newState = (*it == delimiter? inSpace : inToken);
        if( newState != state )
        {
            switch( newState )
            {
            case inSpace:
                result.push_back( StringRef( pTokenBegin, &*it - pTokenBegin ) );
                break;
            case inToken:
                pTokenBegin = &*it;
            }
        }
        state = newState;
    }
    if( state == inToken )
    {
        result.push_back( StringRef( pTokenBegin, &*str.end() - pTokenBegin ) );
    }
    return result;
}

int main() {
    string input_line;
    vector<string> spline;
    long count = 0;
    int sec, lps;
    time_t start = time(NULL);

    cin.sync_with_stdio(false); //disable synchronous IO

    while(cin) {
        getline(cin, input_line);
        //spline.clear(); //empty the vector for the next line to parse

        //I'm trying one of the two implementations, per compilation, obviously:
//        split1(spline, input_line);  
        //split2(spline, input_line);

        vector<StringRef> const v = split3( input_line );
        count++;
    };

    count--; //subtract for final over-read
    sec = (int) time(NULL) - start;
    cerr << "C++   : Saw " << count << " lines in " << sec << " seconds." ;
    if (sec > 0) {
        lps = count / sec;
        cerr << "  Crunch speed: " << lps << endl;
    } else
        cerr << endl;
    return 0;
}

//compiled with: g++ -Wall -O3 -o split1 split_1.cpp -std=c++0x

Disclaimer: I hope there aren't any bugs. I haven't tested the functionality, but only checked the speed. But I think, even if there is a bug or two, correcting that won't significantly affect the speed.

Cheers and hth. - Alf
  • 142,714
  • 15
  • 209
  • 331
  • Thanks! This makes sense. I'm getting an error about the iterator type, so I can't get this to compile yet (still googling), but will update my question with an apples-to-apples speed result once I figure it out (feel free to post an update if it's a problem others are likely to have. I copied your code verbatim and am using g++ version 4.2.1 on OS-X 10.6). – JJC Feb 21 '12 at 14:52
  • 2
    Yes, Python strings are reference counted objects, so Python does far less copying. They still contain null-terminated C strings under the hood, though, not (pointer, size) pairs like your code. – Fred Foo Feb 21 '12 at 14:54
  • @JCC: did you remember to add `-std=c++0x` option? if version 4.2.1 doesn't support `auto` then just use `std::string::const_iterator`. – Cheers and hth. - Alf Feb 21 '12 at 15:02
  • @Cheersandhth.-Alf I had left out the hyphen and the compiler was barfing before getting to tell me it didn't recognize the argument. ;-) Still, this led me to learn that constand_iterator != iterator. So, it was a worthy pursuit. Thanks again for your great answer! – JJC Feb 21 '12 at 15:53
  • Very nice solution. Personally I probably wouldn’t have bothered and just used pairs of iterators. Not that I advocate this. ;-) – Konrad Rudolph Feb 21 '12 at 16:48
  • 14
    In other words - for higher level work, like text manipulation, stick to a higher level language, where effort to do it efficiently has been put cumulatively by tens of developers over tens of years - or just prepare to work as much as all those developers for having something comparable in lower level. – jsbueno Feb 21 '12 at 18:04
  • @jsbueno After thinking about how I'd use Alf's StringRef objects in my code, I came to a similar conclusion just now. :-) I still like his solution when raw split speed is important and when most values will not be later used. Otherwise, I'm too much of a noob to see how to easily make these StringRef objects easily usable in my code. I suppose a class method to return an actual string object would be a good start. – JJC Feb 22 '12 at 13:38
  • Relatedly, Alf, if you have more example code of how you'd use such StringRefs in a larger program that needs to store, compare, map these string objects, that would be really helpful and educational to see. A pastebin or blog post would be great if you have time. Thanks again. – JJC Feb 22 '12 at 13:40
  • @JJC: The poor man's solution is just `shared_ptr`. but keep in mind that that is not a solution to this SO question, but instead a solution to your more general requirements, in particular ability to store such references. With C++03 this primitive solution could speed up e.g. sorting tremendously, but there is a threshold: at some point the string data must be copied out of e.g. Google's API, or any API that does not support this. For a *sketch* of a more proper solution, take a look at my old ["StringValue"](http://sourceforge.net/projects/alfsstringvalue/) project at SourceForge. – Cheers and hth. - Alf Feb 22 '12 at 14:47
  • 2
    @JJC: for the `StringRef`, you can copy the substring to a `std::string` very easily, just `string( sr.begin(), sr.end() )`. – Cheers and hth. - Alf Feb 22 '12 at 14:59
  • @Cheersandhth.-Alf Thanks Alf, I'll look into these tips. – JJC Feb 22 '12 at 22:44
  • 3
    I wish CPython strings were copied less. Yes, they are reference counted and immutable but [str.split() allocates new strings for each item](http://hg.python.org/cpython/file/2.7/Objects/stringlib/split.h#l35) using `PyString_FromStringAndSize()` that calls `PyObject_MALLOC()`. So there is no optimization with a shared representation that exploits that the strings are immutable in Python. – jfs Aug 10 '12 at 16:14
  • 3
    Maintainers: please do not introduce bugs by attempting to fix *perceived* bugs (especially not with reference to http://www.cplusplus.com). TIA. – Cheers and hth. - Alf Feb 12 '15 at 05:26
10

I'm not providing any better solutions (at least performance-wise), but some additional data that could be interesting.

Using strtok_r (reentrant variant of strtok):

void splitc1(vector<string> &tokens, const string &str,
        const string &delimiters = " ") {
    char *saveptr;
    char *cpy, *token;

    cpy = (char*)malloc(str.size() + 1);
    strcpy(cpy, str.c_str());

    for(token = strtok_r(cpy, delimiters.c_str(), &saveptr);
        token != NULL;
        token = strtok_r(NULL, delimiters.c_str(), &saveptr)) {
        tokens.push_back(string(token));
    }

    free(cpy);
}

Additionally using character strings for parameters, and fgets for input:

void splitc2(vector<string> &tokens, const char *str,
        const char *delimiters) {
    char *saveptr;
    char *cpy, *token;

    cpy = (char*)malloc(strlen(str) + 1);
    strcpy(cpy, str);

    for(token = strtok_r(cpy, delimiters, &saveptr);
        token != NULL;
        token = strtok_r(NULL, delimiters, &saveptr)) {
        tokens.push_back(string(token));
    }

    free(cpy);
}

And, in some cases, where destroying the input string is acceptable:

void splitc3(vector<string> &tokens, char *str,
        const char *delimiters) {
    char *saveptr;
    char *token;

    for(token = strtok_r(str, delimiters, &saveptr);
        token != NULL;
        token = strtok_r(NULL, delimiters, &saveptr)) {
        tokens.push_back(string(token));
    }
}

The timings for these are as follows (including my results for the other variants from the question and the accepted answer):

split1.cpp:  C++   : Saw 20000000 lines in 31 seconds.  Crunch speed: 645161
split2.cpp:  C++   : Saw 20000000 lines in 45 seconds.  Crunch speed: 444444
split.py:    Python: Saw 20000000 lines in 33 seconds.  Crunch Speed: 606060
split5.py:   Python: Saw 20000000 lines in 35 seconds.  Crunch Speed: 571428
split6.cpp:  C++   : Saw 20000000 lines in 18 seconds.  Crunch speed: 1111111

splitc1.cpp: C++   : Saw 20000000 lines in 27 seconds.  Crunch speed: 740740
splitc2.cpp: C++   : Saw 20000000 lines in 22 seconds.  Crunch speed: 909090
splitc3.cpp: C++   : Saw 20000000 lines in 20 seconds.  Crunch speed: 1000000

As we can see, the solution from the accepted answer is still fastest.

For anyone who would want to do further tests, I also put up a Github repo with all the programs from the question, the accepted answer, this answer, and additionally a Makefile and a script to generate test data: https://github.com/tobbez/string-splitting.

tobbez
  • 317
  • 2
  • 8
  • 2
    I did a pull request ( https://github.com/tobbez/string-splitting/pull/2 ) that makes the test a little more realistic by "using" the data (counting number of words and characters). With this change, all of the C/C++ versions beat the Python versions (expect for the one based on Boost's tokenizer that I added) and the real value of "string view" based methods (like that of split6) shine. – Dave Johansen Apr 15 '15 at 05:22
  • You should use `memcpy`, not `strcpy`, in case the compiler fails to notice that optimization. `strcpy` typically uses a slower startup strategy that strikes a balance between fast for short strings vs. ramp up to full SIMD for long strings. `memcpy` knows the size right away, and doesn't have to use any SIMD tricks to check for the end of an implicit-length string. (Not a big deal on modern x86). Creating `std::string` objects with the `(char*, len)` constructor might be faster, too, if you can get that out of `saveptr-token`. Obviously it would be fastest to just store `char*` tokens :P – Peter Cordes Apr 29 '18 at 05:01
4

I think the following code is better, using some C++17 and C++14 features:

// These codes are un-tested when I write this post, but I'll test it
// When I'm free, and I sincerely welcome others to test and modify this
// code.

// C++17
#include <istream>     // For std::istream.
#include <string_view> // new feature in C++17, sizeof(std::string_view) == 16 in libc++ on my x86-64 debian 9.4 computer.
#include <string>
#include <utility>     // C++14 feature std::move.

template <template <class...> class Container, class Allocator>
void split1(Container<std::string_view, Allocator> &tokens, 
            std::string_view str,
            std::string_view delimiter = " ") 
{
    /* 
     * The model of the input string:
     *
     * (optional) delimiter | content | delimiter | content | delimiter| 
     * ... | delimiter | content 
     *
     * Using std::string::find_first_not_of or 
     * std::string_view::find_first_not_of is a bad idea, because it 
     * actually does the following thing:
     * 
     *     Finds the first character not equal to any of the characters 
     *     in the given character sequence.
     * 
     * Which means it does not treeat your delimiters as a whole, but as
     * a group of characters.
     * 
     * This has 2 effects:
     *
     *  1. When your delimiters is not a single character, this function
     *  won't behave as you predicted.
     *
     *  2. When your delimiters is just a single character, the function
     *  may have an additional overhead due to the fact that it has to 
     *  check every character with a range of characters, although 
     * there's only one, but in order to assure the correctness, it still 
     * has an inner loop, which adds to the overhead.
     *
     * So, as a solution, I wrote the following code.
     *
     * The code below will skip the first delimiter prefix.
     * However, if there's nothing between 2 delimiter, this code'll 
     * still treat as if there's sth. there.
     *
     * Note: 
     * Here I use C++ std version of substring search algorithm, but u
     * can change it to Boyer-Moore, KMP(takes additional memory), 
     * Rabin-Karp and other algorithm to speed your code.
     * 
     */

    // Establish the loop invariant 1.
    typename std::string_view::size_type 
        next, 
        delimiter_size = delimiter.size(),  
        pos = str.find(delimiter) ? 0 : delimiter_size;

    // The loop invariant:
    //  1. At pos, it is the content that should be saved.
    //  2. The next pos of delimiter is stored in next, which could be 0
    //  or std::string_view::npos.

    do {
        // Find the next delimiter, maintain loop invariant 2.
        next = str.find(delimiter, pos);

        // Found a token, add it to the vector
        tokens.push_back(str.substr(pos, next));

        // Skip delimiters, maintain the loop invariant 1.
        //
        // @ next is the size of the just pushed token.
        // Because when next == std::string_view::npos, the loop will
        // terminate, so it doesn't matter even if the following 
        // expression have undefined behavior due to the overflow of 
        // argument.
        pos = next + delimiter_size;
    } while(next != std::string_view::npos);
}   

template <template <class...> class Container, class traits, class Allocator2, class Allocator>
void split2(Container<std::basic_string<char, traits, Allocator2>, Allocator> &tokens, 
            std::istream &stream,
            char delimiter = ' ')
{
    std::string<char, traits, Allocator2> item;

    // Unfortunately, std::getline can only accept a single-character 
    // delimiter.
    while(std::getline(stream, item, delimiter))
        // Move item into token. I haven't checked whether item can be 
        // reused after being moved.
        tokens.push_back(std::move(item));
}

The choice of container:

  1. std::vector.

    Assuming the initial size of allocated internal array is 1, and the ultimate size is N, you will allocate and deallocate for log2(N) times, and you will copy the (2 ^ (log2(N) + 1) - 1) = (2N - 1) times. As pointed out in Is the poor performance of std::vector due to not calling realloc a logarithmic number of times?, this can have a poor performance when the size of vector is unpredictable and could be very large. But, if you can estimate the size of it, this'll be less a problem.

  2. std::list.

    For every push_back, the time it consumed is a constant, but it'll probably takes more time than std::vector on individual push_back. Using a per-thread memory pool and a custom allocator can ease this problem.

  3. std::forward_list.

    Same as std::list, but occupy less memory per element. Require a wrapper class to work due to the lack of API push_back.

  4. std::array.

    If you can know the limit of growth, then you can use std::array. Of cause, you can't use it directly, since it doesn't have the API push_back. But you can define a wrapper, and I think it's the fastest way here and can save some memory if your estimation is quite accurate.

  5. std::deque.

    This option allows you to trade memory for performance. There'll be no (2 ^ (N + 1) - 1) times copy of element, just N times allocation, and no deallocation. Also, you'll has constant random access time, and the ability to add new elements at both ends.

According to std::deque-cppreference

On the other hand, deques typically have large minimal memory cost; a deque holding just one element has to allocate its full internal array (e.g. 8 times the object size on 64-bit libstdc++; 16 times the object size or 4096 bytes, whichever is larger, on 64-bit libc++)

or you can use combo of these:

  1. std::vector< std::array<T, 2 ^ M> >

    This is similar to std::deque, the difference is just this container doesn't support to add element at the front. But it is still faster in performance, due to the fact that it won't copy the underlying std::array for (2 ^ (N + 1) - 1) times, it'll just copy the pointer array for (2 ^ (N - M + 1) - 1) times, and allocating new array only when the current is full and doesn't need to deallocate anything. By the way, you can get constant random access time.

  2. std::list< std::array<T, ...> >

    Greatly ease the pressure of memory framentation. It will only allocate new array when the current is full, and does not need to copy anything. You will still have to pay the price for an additional pointer conpared to combo 1.

  3. std::forward_list< std::array<T, ...> >

    Same as 2, but cost the same memory as combo 1.

JiaHao Xu
  • 2,452
  • 16
  • 31
  • If you use std::vector with some reasonable initial size, like 128 or 256, the total copies (assuming a growth factor of 2), you avoid any copying at all for sizes up to that limit. You can then shrink the allocation to fit the number of elements you actually used so it's not terrible for small inputs. This doesn't help much with the total number of copies for the very large `N` case, though. It's too bad [std::vector can't use `realloc` to potentially allow mapping more pages at the end of the current allocation](//stackoverflow.com/q/49615076), so it's about 2x slower. – Peter Cordes Apr 29 '18 at 04:37
  • Is `stringview::remove_prefix` as cheap as just keeping track of your current position in a normal string? [`std::basic_string::find`](http://en.cppreference.com/w/cpp/string/basic_string/find) has an optional 2nd arg `pos = 0` to let you start searching from an offset. – Peter Cordes Apr 29 '18 at 04:43
  • @ Peter Cordes That is correct. I checked [libcxx impl](https://github.com/llvm-mirror/libcxx/blob/master/include/string_view) – JiaHao Xu Apr 29 '18 at 04:52
  • I also checked [libstdc++ impl](https://github.com/gcc-mirror/gcc/blob/master/libstdc%2B%2B-v3/include/experimental/string_view), which is the same. – JiaHao Xu Apr 29 '18 at 04:55
  • Your analysis of vector's performance is off. Consider a vector that has an initial capacity of 1 when you first insert and it doubles every time it needs new capacity. If you need to put in 17 items, the first allocation makes room for 1, then 2, then 4, then 8, then 16, then finally 32. This means there were 6 allocations total (`log2(size - 1) + 2`, using integer log). The first allocation moved 0 strings, the second moved 1, then 2, then 4, then 8, then finally 16, for a total of 31 moves (`2^(log2(size - 1) + 1) - 1)`). This is O(n), not O(2^n). This will greatly outperform `std::list`. – David Stone Apr 29 '18 at 06:07
  • In the original answer, I assumed the size of the ultimate element is (2 ^N), that is why it's so large. Thankyou for pointing out this, I should use log2 instead. – JiaHao Xu Apr 29 '18 at 07:30
  • With only an average of `2N - 1` copies for `N` push_back operations, that's very likely cheaper than `std::list`. A `list` node needs a pointer as well as the data, so that's extra work right there. For `int` or `long` elements, it's as much work as storing a data element. Allocating memory for `N` new nodes is also expensive: unless you have per-thread memory pools, `new` probably needs an atomic operation to be thread-safe. And then if you want to ever look at your data, `std::vector` is far cheaper than walking a linked list. – Peter Cordes May 05 '18 at 12:54
4

I suspect that this is because of the way std::vector gets resized during the process of a push_back() function call. If you try using std::list or std::vector::reserve() to reserve enough space for the sentences, you should get a much better performance. Or you could use a combination of both like below for split1():

void split1(vector<string> &tokens, const string &str,
        const string &delimiters = " ") {
    // Skip delimiters at beginning
    string::size_type lastPos = str.find_first_not_of(delimiters, 0);

    // Find first non-delimiter
    string::size_type pos = str.find_first_of(delimiters, lastPos);
    list<string> token_list;

    while (string::npos != pos || string::npos != lastPos) {
        // Found a token, add it to the list
        token_list.push_back(str.substr(lastPos, pos - lastPos));
        // Skip delimiters
        lastPos = str.find_first_not_of(delimiters, pos);
        // Find next non-delimiter
        pos = str.find_first_of(delimiters, lastPos);
    }
    tokens.assign(token_list.begin(), token_list.end());
}

EDIT: The other obvious thing I see is that Python variable dummy gets assigned each time but not modified. So it's not a fair comparison against C++. You should try modifying your Python code to be dummy = [] to initialize it and then do dummy += line.split(). Can you report the runtime after this?

EDIT2: To make it even more fair can you modify the while loop in C++ code to be:

    while(cin) {
        getline(cin, input_line);
        std::vector<string> spline; // create a new vector

        //I'm trying one of the two implementations, per compilation, obviously:
//        split1(spline, input_line);  
        split2(spline, input_line);

        count++;
    };
Vite Falcon
  • 6,575
  • 3
  • 30
  • 48
  • Thanks for the idea. I implemented it and this implementation is actually slower than the original split1, unfortunately. I also tried spline.reserve(16) before the loop, but this had no impact on my split1's speed. There are only three tokens per line, and the vector is cleared after each line, so I didn't expect that to help much. – JJC Feb 21 '12 at 14:11
  • I tried your edit as well. Please see the updated question. Performance is now on-par with split1. – JJC Feb 21 '12 at 14:54
  • I tried your EDIT2. The performance was a bit worse : $/usr/bin/time cat test_lines_double | ./split7 33.39 real 0.01 user 0.49 sys C++ : Saw 20000000 lines in 33 seconds. Crunch speed: 606060 – JJC Feb 21 '12 at 15:30
2

You're making the mistaken assumption that your chosen C++ implementation is necessarily faster than Python's. String handling in Python is highly optimized. See this question for more: Why do std::string operations perform poorly?

Community
  • 1
  • 1
Matt Joiner
  • 112,946
  • 110
  • 377
  • 526
  • 4
    I'm not making any claims about overall language performance, only about my particular code. So, no assumptions here. Thanks for the good pointer to the other question. I'm not sure if you're saying that this particular implementation in C++ is suboptimal (your first sentence) or that C++ is just slower than Python in string processing (your second sentence). Also, if you know of a fast way to do what I'm trying to do in C++ please share it for everyone's benefit. Thanks. Just to clarify, I love python, but I'm no blind fanboy, which is why I'm trying to learn the fastest way to do this. – JJC Feb 21 '12 at 14:25
  • 1
    @JJC: Given that Python's implementation is faster, I'd say yours is suboptimal. Keep in mind that language implementations can cut corners for you, but ultimately algorithmic complexity and hand optimizations win out. In this case, Python has the upper hand for this use case by default. – Matt Joiner Feb 22 '12 at 02:49
2

If you take the split1 implementaion and change the signature to more closely match that of split2, by changing this:

void split1(vector<string> &tokens, const string &str, const string &delimiters = " ")

to this:

void split1(vector<string> &tokens, const string &str, const char delimiters = ' ')

You get a more dramatic difference between split1 and split2, and a fairer comparison:

split1  C++   : Saw 10000000 lines in 41 seconds.  Crunch speed: 243902
split2  C++   : Saw 10000000 lines in 144 seconds.  Crunch speed: 69444
split1' C++   : Saw 10000000 lines in 33 seconds.  Crunch speed: 303030
Paul Beckingham
  • 14,495
  • 5
  • 33
  • 67
1
void split5(vector<string> &tokens, const string &str, char delim=' ') {

    enum { do_token, do_delim } state = do_delim;
    int idx = 0, tok_start = 0;
    for (string::const_iterator it = str.begin() ; ; ++it, ++idx) {
        switch (state) {
            case do_token:
                if (it == str.end()) {
                    tokens.push_back (str.substr(tok_start, idx-tok_start));
                    return;
                }
                else if (*it == delim) {
                    state = do_delim;
                    tokens.push_back (str.substr(tok_start, idx-tok_start));
                }
                break;

            case do_delim:
                if (it == str.end()) {
                    return;
                }
                if (*it != delim) {
                    state = do_token;
                    tok_start = idx;
                }
                break;
        }
    }
}
n. m. could be an AI
  • 112,515
  • 14
  • 128
  • 243
  • Thanks n.m.! Unfortunately, this seems to run at about the same speed as the original (split 1) implementation on my dataset and machine: $ /usr/bin/time cat test_lines_double | ./split8 21.89 real 0.01 user 0.47 sys C++ : Saw 20000000 lines in 22 seconds. Crunch speed: 909090 – JJC Feb 22 '12 at 09:14
  • On my machine: split1 — 54s , split.py — 35s, split5 — 16s. I have no idea. – n. m. could be an AI Feb 22 '12 at 10:28
  • Hmm, does your data match the format I noted above? I assume you ran each several times to eliminate transient effects like initial disk cache population? – JJC Feb 22 '12 at 11:26
0

I suspect that this is related to buffering on sys.stdin in Python, but no buffering in the C++ implementation.

See this post for details on how to change the buffer size, then try the comparison again: Setting smaller buffer size for sys.stdin?

Community
  • 1
  • 1
  • 1
    Hmmm... I don't follow. Just reading lines (without the split) is faster in C++ than Python (after including the cin.sync_with_stdio(false); line). That was the issue I had yesterday, referenced above. – JJC Feb 21 '12 at 14:01