0

With my surprise, I've noticed that the following C++ that simply reads a large file line, stores the lines into a vector and outputs the vector size runs way slower than its Python counterpart. How to optimize it? Thanks

#include <fstream>
#include <iostream>
#include <vector>
#include <string>

   int main () {
    std::ifstream infile("longfile.txt");
    std::string line;
    std::vector<std::string> lines;

    while (std::getline(infile, line)) {
        lines.push_back(line);
    }

    std::cout << lines.size() << std::endl;

    return 0;
}

Run:

$ time ./a.out
1390000

real    0m6.388s
user    0m6.130s
sys 0m0.243s

Python:

with open('longfile.txt') as f:
    lines = f.readlines()

print len(lines)

Run:

$ time python test.py
1390000

real    0m1.003s
user    0m0.158s
sys     0m0.146s
pistacchio
  • 56,889
  • 107
  • 278
  • 420
  • No, I don't know it in advance. – pistacchio Feb 25 '16 at 16:36
  • What the Python `readlines` function does is basically what your C++ program does. If you don't know the number of lines to begin with, and if they are not all the exact same size, there's really no way to optimize it. – Some programmer dude Feb 25 '16 at 16:37
  • Would it help to use a `list` (i.e. a linked list) rather than a `vector` (i.e. a wrapped array)? – khelwood Feb 25 '16 at 16:44
  • It will be faster, but it won't be the same as the python solution that leaves me with a list to use. I'm just trying to understand how come the "slow" language comes 6 times faster that the "fast" one. – pistacchio Feb 25 '16 at 16:45
  • After seeing what you're doing, there *are* ways to optimize it. One way if you don't really care about the content is the one suggested by @FredrikRosenqvist (now removed comment, but he said that there' no need for the vector which will save quite a lot of time). Another way if you don't need the actual data is to read a chunk of bytes, and look manually for the newline character. – Some programmer dude Feb 25 '16 at 16:45
  • 1
    Some questions: (1) how did you compile the C++ program, did you use optimization flags like `-O2`? If not the comparison is moot. (2) re your timing values, can you repeat and confirm them? Python could just be faster because by the second run data was cached in memory. We need more info. – mindriot Feb 25 '16 at 16:49
  • You nailed it. I didn't turn optimization on. With the -O3 flag I get an average of 1.7 seconds. Still slower than python (why) but not 6 times. – pistacchio Feb 25 '16 at 16:51
  • 1
    Run the two tests repeatedly and in alternating order, to exclude any odd effects due to caching at the disk and/or the OS level. – mindriot Feb 25 '16 at 16:53
  • 1
    Try `std::deque` It does not require that the internal storage is organized in a continuously memory region as vector requires. – kwarnke Feb 25 '16 at 17:01
  • @pistacchio You're using `std::vector` where it's inappropriate. C++ gives you the choice of many different containers so that you can choose the right one for your use case -- list, deque, vector, array, and so on. As it happens `vector` is not a good generic "I don't want to think about which container to choose" option, but `list` is. If you must use `std::vector` for some reason, call `reserve` with a fairly large number before you call `push_back`. – David Schwartz Feb 25 '16 at 22:34
  • Have you checked this? - http://stackoverflow.com/questions/9371238/why-is-reading-lines-from-stdin-much-slower-in-c-than-python – mementum Feb 26 '16 at 14:01

2 Answers2

2

if you can know in advance the number of line you have to read you can do a reservation of the number of line in your std::vector object. Let say you have 10 lines to read :

   int main () {
    std::ifstream infile("longfile.txt");
    std::string line;
    std::vector<std::string> lines;

    lines.reserve(10);

    while (std::getline(infile, line)) {
        lines.push_back(line);
    }

    std::cout << lines.size() << std::endl;

    return 0;
}

Like this your memory will be allocate before you start to push to work. Becarefull reserve is not resize. I mean lines.reserve(10); will allocate in advance 10 std::string but lines.empty() will still be true.

If you can't know the number of line in advance std::list (double linked list), or std::forward_list (simple link list) will be helpfull. Each new element will be added to the list untill you have finish to read the file. In a list there is no need to reallocate memory, what you must do with std::vector every time you reach the maximum capacity. Reallocate and copy element in memory is very expensive in terms of time. With a list you reduce at least the spent of time during the parsing of the file.

After the file parsed using copy the list to a std::vector is a good idea because you already know the memory size you need an contiguous memory allocations is faster fo access.

Whatever you choice I highly recommand to change :

while (std::getline(infile, line)) {
    lines.push_back(line);
}

by :

    while (std::getline(infile, line)) {
        lines.push_back(std::move(line));
    }

By default for most of the containers in the STL the call of the copy constructor or the assignement operator will make a full duplication of the datas. std::move prevent such copy.

You can easily check it with the following exemple :

std::string a("Hello");
std::string b(a);

std::cout<<a.size()<<" "<<b.size()<<std::endl;
std::cout<<"the address of a is : "<<a.c_str()<<" "<<b.c_str()<<std::endl;

std::string d(std::move(a));

std::cout<<a.size()<<" "<<d.size()<<std::endl;

std::string e;
std::string f;

e = b;

std::cout<<e.size()<<" "<<b.size()<<std::endl;
std::cout<<"the address of a is : "<<e.c_str()<<" "<<b.c_str()<<std::endl;


f = std::move(b);

std::cout<<f.size()<<" "<<b.size()<<std::endl;
John_Sharp1318
  • 939
  • 8
  • 19
  • `std::vector` is still generally going to be faster than `std::list` and variants, even if you don't know the size up front. The O() performance for inserting N elements is the same: O(N) in both cases, but vector's method of periodically allocating large chunks of memory is much faster than lists per-node allocation. – BeeOnRope Feb 26 '16 at 00:05
1

The answer to almost any optimization problem is "first, profile". Profile your C++ application and determine where the time is being spent.

Still, I can make some educated guesses about what is slow here, and point out how this would show up in the profiler.

Slow getline()

getline() may be implemented in a slow manner. For example, it may need to ask the runtime for one character at a time, since it needs to stop reading once it seems the newline character. That is, it cannot ask for bytes in bigger chunks, since it has no guaranteed way to "put back" the rest of the chunk when a newline appears in the middle of the chunk.

The runtime is almost certainly going to buffer the underlying file read, so this won't be anywhere near as bad as one system call per character, but the overhead of effetively calling getc for every character in the file can still be significant.

This would show up in the profiler as a lot of time spent in getline() - and in particular in some getc()-like method that getline calls.

The python implementation doesn't have this issue at all, since a single readlines() call is made and the implementation knows the entire file will be read and can buffer at will.

Redundant Copying

The other likely candidate is redundant copying.

First runtime makes read() calls and copies chunks of the file into an internal buffer. Then the getline() implementation is likely going to have an internal buffer of char[] where it builds up the string before passing it to a string constructor, which likely makes another copy (unless the runtime is using internal tricks to hand off the buffer directly).

Then, as Johnny_S points out, there may be more copies when you push these strings into the vector.

This would show up in the vector as time spent spread around in the various copies as mentioned above, e.g., in the string() constructor.

The python implementation can also avoid most of these redundant copies since it has a higher level view of the problem, and rather than the layered approach in your C++ implementation, so it likely only makes 1 or 2 copies.

Solutions

The solution mentioned here fixes both of the problems above. To re-implement the Python readlines call, you should go a bit lower level. Read the file in chunks of char[], and look for newline characters directly in the buffer. Technically, you don't need to create string objects at all, since you only output the number of lines found, but if you do want to create those objects, make sure you only copy the char[] data once into each string.

You can do this using the string (const char* s, size_t n) constructor, pointing directly into your character buffer. Finally, ensure you don't make another copy when you copy into your vector, as Johnny_S suggests.

BeeOnRope
  • 60,350
  • 16
  • 207
  • 386