Optimize file reading C++

Question

string str1, str2;
vector<string> vec;

ifstream infile;

infile.open("myfile.txt");
while (! infile.eof() )
{
    getline(infile,str1);
    istringstream is;
        is >> str1;
    while (is >> str2)
    { 
        vec.push_back(str2);
    }
}

What the code does is read the string from the file and stores it into a vector.

Performance needs to be given priority. How can i optimize this code, to make the reading performance faster?

First thing first : [Why is iostream::eof inside a loop condition considered wrong?](http://stackoverflow.com/q/5605125/1870232) — P0W, Jan 25 '14 at 10:13
What format? It seems that you are trying to parse the file. I don't really see much room for optimization here. Have you actually **measured** that this is a bottleneck? — , Jan 25 '14 at 10:14

score 4 · Accepted Answer · edited May 23 '17 at 12:21

As others have already pointed out (see for example herohuyongtao's answer), the loop condition and how you put str1 into the istringstream must be fixed.

However, there is an important issue here that everybody has missed so far: you don't need the istringstream at all!

  vec.reserve(the_number_of_words_you_exptect_at_least);

  while (infile >> str1) {

    vec.push_back(str1);
  }

It gets rid of the inner loop which you didn't need in the first place, and doesn't create an istringstream in each iteration.

If you need to parse further each line and you do need an istringstream, create it outside of the loop and set its string buffer via istringstream::str(const string& s).

I can easily imagine your loops being very slow: Heap allocation on Windows is outrageously slow (compared to Linux); I got bitten once.

Andrei Alexandrescu presents (in some sense) a similar example in his talk Writing Quick Code in C++, Quickly. The surprising thing is that doing unnecessary heap allocations in a tight loop like the one above can be slower than the actual file IO. I was surprised to see that.

You didn't tag your question as C++11 but here is what I would do in C++11.

  while (infile >> str1) {

    vec.emplace_back(std::move(str1));
  }

This move constructs the string at the back of the vector, without copying in. We can do it because we don't need the contents of str1 after we have put it into the vector. In other words, there is no need to copy it into a brand new string at the back of the vector, it is sufficient to just move its contents there. The first loop with the vec.push_back(str1); may potentially copy the contents of str1 which is really unnecessary.

The string implementation in gcc 4.7.2 is currently copy on write, so the two loops have identical performance; it doesn't matter which one you use. For now.

Unfortunately, copy on write strings are now forbidden by the standard. I don't know when the gcc developers are going to change the implementation. If the implementation changes, it may make a difference in performance whether you move (emplace_back(std::move(s))) or you copy (push_back(s)).

If C++98 compatibility is important for you, then go with push_back(). Even if the worst thing happens in the future and your string is copied (it isn't copied now), that copy can be turned into a memmove() / memcpy() which is blazing fast, most likely faster than reading the contents of the file from the hard disk so file IO will most likely remain the bottleneck.

+1 - can well be http://en.wikipedia.org/wiki/Golden_mean_%28philosophy%29 for many — bobah, Jan 25 '14 at 11:04
my file contains a big bunch of data, i have no idea how much to reserve,how do i go abt this? — lily, Jan 25 '14 at 11:38
@aayat Then give a reasonable estimate for the amount you expect *at least*. Nothing disastrous happens if this estimate is too low, the vector will resize automatically; I wouldn't worry about this so much. Which compiler are you using, gcc, Visual Studio or something else? — Ali, Jan 25 '14 at 11:43
@aayat OK. Please read the updated answer. It concerns the gcc implementation, directly relevant to you. — Ali, Jan 25 '14 at 12:33
@Ali As u mentioned, i can create istringstream outside of the loop and set it using istringstream::str(str1); When it loops again, the string buffer doesn't change — lily, Jan 26 '14 at 02:24
@aayat Declare `istringstream is;` outside of the loop. Then call `is.clear();` right before `is.str(str1);` in the loop. However, unless you do some very complicated parsing, there is no need for the `istringstream` object. In the example you are showing us in your question, the `istringstream` object is unnecessary. — Ali, Jan 26 '14 at 11:39
@aayat OK. Do you have any further question concerning the code you show in your *original* question? — Ali, Jan 26 '14 at 12:34
I came to a conclusion that I would have to do it like this, istringstream is; while ( getline(infile,str1) ) { is.clear(); is.str1; while (is >> str2){ vec.push_back(str2); } if(vec.size == 7) // each line contains 7 fields { //do smtg} } } Because i need to do smtg, and then clear the vector after reading 7 words. — lily, Jan 26 '14 at 12:41

herohuyongtao · Answer 2 · 2014-01-25T11:12:43.450

2

Before any optimization, you need to change

while (! infile.eof() )      // problem 1
{
    getline(infile,str1);
    istringstream is;
        is >> str1;          // problem 2
    while (is >> str2){ 
        vec.push_back(str2);
        }
 }

to

while ( getline(infile,str1) ) // 1. don't use eof() in a while-condition
{
    istringstream is(str1);    // 2. put str1 to istringstream
    while (is >> str2){ 
        vec.push_back(str2);
        }
 }

to make it work as you expected.

P.S. For the optimization part, you don't need to think too much on it unless it becomes a bottleneck. Premature optimization is the root of all evil. However, if you do want to speed it up, check out @Ali's answer for further info.

edited Jan 25 '14 at 11:12

answered Jan 25 '14 at 10:18

herohuyongtao

49,413
29
133
174

@herohuyongtao I am **not** the downvoter but I *guess* the downvote was because of missing an important detail. Please read my answer. – Ali Jan 25 '14 at 11:00
@Ali Yes, you're probably right. However, I still doubt how much this will help speed it up. – herohuyongtao Jan 25 '14 at 11:02
1

@herohuyongtao Andrei Alexandrescu presents a similar example in his talk [Writing Quick Code in C++, Quickly](http://channel9.msdn.com/Events/GoingNative/2013/Writing-Quick-Code-in-Cpp-Quickly). The surprising thing is that doing unnecessary heap allocations in a tight loop can become slower than the actual file IO. I was surprised too. – Ali Jan 25 '14 at 11:06
1

@herohuyongtao +1 for fixing the bugs in the OP and for properly referring to the other optimization opportunities instead of stealing them (not everybody does that, unfortunately). – Ali Jan 25 '14 at 12:43

score 1 · Answer 3 · edited May 23 '17 at 11:56

Loop condition is wrong. Not a performance issue. Assuming this IO loop is indeed your application's bottleneck. But even if not, it can be a good educational exercise or just a weekend fun.

You have quite a few temporaries and cases of dynamic memory allocation in the loop.

Calling std::vector::reserve() in front of the loop will improve it a bit. Reallocating it manually to emulate x1.2 grow factor opposed to 2x after some size will help as well. std::list may though be more appropriate if the file size is unpredictable.

Using std::istringstream as a tokenizer is very inoptimal. Switching to iterator-based "view" tokenizer (Boost has one) should improve the speed a lot.

If you need it to be very fast and have enough RAM you can memory map the file before reading it. Boost::iostreams can let you get there quick. Generally, though, without Boost you can be twice as fast (Boost is not bad but it has to be generic and work on a dozen of compilers this is why).

If you are a blessed person using Unix/Linux as your development environment run your program under valgrind --tool=cachegrind and you will see all problematic places and how bad they are relative to one other. Also, valgrind --tool=massif will let you identify nubmerous small heap allocated objects which is generally not tolerable in the high performance code.

Of course as Ali pointed out, the tokenizer is not needed at all, because the lines don't have any meaning for the result. But in general the link to boost is useful and the links to valgrind are even more useful. — Jan Hudec, Jan 25 '14 at 10:56
@JanHudec - agree, I upvoted Ali's answer, but I don't think the skill of tokenizing strings efficiently that I shared in my answer is a waste of space on this page — bobah, Jan 25 '14 at 11:06

CouchDeveloper · Answer 4 · 2014-01-25T11:12:21.133

The fastest, though not fully portable, approach is to load the file into a memory mapped region (see wiki mmap.

Given that you know the size of the file, you now can define forward iterators (possibly pointer to const char) on that memory region which you can use to find the tokens which separate your file into "strings".

Essentially, you repeatedly get a pair of pointers pointing to the first character respectively the end of each "string". From that pair of iterators create your std::string.

This approach has subtle issues though:

You need to take care of the character encoding of the file, possibly convert from this character encoding to your desired encoding which is used by your std::string (presumable UTF-8).
The "token" to separate strings (usually \n, may be platform dependent, or may depend on which program created the file.

http://www.boost.org/doc/libs/1_55_0/libs/iostreams/doc/index.html makes it all in a portable way (with tolerable performance penalty) — bobah, Jan 27 '14 at 14:06

Optimize file reading C++

4 Answers4