1

I have a text file with ~4 mio floats, i.e 30MB, and I want to read them into a vector<float>.

The code I have is very bare bone, and gets the job done

std::fstream is("data.txt", std::ios_base::in);

float number;
while (is >> number)
{
   //printf("%f ", number);
   number_vec.push_back(number);
}

The problem is that it takes 20-30 s on a modern desktop workstation. At first I assumed I did something stupid, but the more I starred at the code, the more I started accepting that maybe it was just the time it takes to parse all those ascii float values into floats

However, then I remembered that Matlab can read, and parse, the same file almost instantly (disk speed seems to be the limit), so it is obvious that my code is just very inefficient.

The only thing I could think of was to reserve the required elements in the vector in advance, but it didn't improve the situation at all.

Can someone help me understand why? and maybe help writing a faster solution?

EDIT The textfile looks like this:

152.00256 45.8569 5.87214 0.225 -0.0005 .....

i.e. One row, space delimited.

Markus
  • 2,526
  • 4
  • 28
  • 35
  • 1
    Please post sample of text file. – zett42 May 10 '17 at 20:26
  • 1
    you need to use a profiler to pinpoint bottlenecks. beyond that reserving space in the vector ahead of time should provide a significant boost. – Brad Allred May 10 '17 at 20:29
  • 2
    Possible duplicate of [Fast textfile reading in c++](http://stackoverflow.com/questions/17925051/fast-textfile-reading-in-c) – gsamaras May 10 '17 at 20:32
  • 3
    Possible duplicate of [How to parse space-separated floats in C++ quickly?](http://stackoverflow.com/questions/17465061/how-to-parse-space-separated-floats-in-c-quickly) – Brad Allred May 10 '17 at 20:35
  • For an idea on performance see http://stackoverflow.com/questions/3664272/is-stdvector-so-much-slower-than-plain-arrays – jsn May 10 '17 at 20:35
  • @BradAllred I'm trying to keep up with all the other comments. But your link does look like the same problem I have. I'll give it a shot. – Markus May 10 '17 at 20:43
  • Tried your code on my system, takes around 1.3s with 4M floating point numbers (72MiB). Are you sure you are measuring an optimized build? – Baum mit Augen May 10 '17 at 20:45
  • @BaummitAugen I'm sure. But if you look at Brad Allreds link you can see that fstream is extremly slow on windows compared to maybe fscan. Are you running linux? – Markus May 10 '17 at 20:50
  • @Markus Yes, so maybe that's it. (Even on Linux, `fscanf` turns out to be ca. 30% faster, but don't sue me if I did not measure this correctly.) – Baum mit Augen May 10 '17 at 20:53
  • That's like 3000 cycles/byte, or 20000 cycles/float, which is ridiculous. Try reading the whole thing into a string first; if it's significantly faster then your code is likely not buffering the reads. – Veedrac May 11 '17 at 01:16

1 Answers1

0

please consider taking a look at the possible duplicates shared by @gsamaras and @Brad Allred. Anyway, I will try to reply with a simple answer that will aim on keeping the code simplicity/friendliness and consider the following two premises:

  • You have a constraint regarding the file and will neither change the file format, neither the way floats are presented textually in it.
  • You want to keep using STL and are not looking for a library specialized/optimized for the challenge you are facing.

With those stated constraints and mindset, my main suggestion would be to preallocate your containers, both the float vector as the internal iostream buffer:

  • Increase performance of insertion in number_vec by reserving the required size in the std vector. This can be achieved by a call to reserve as explained in this stackoverflow post.
  • Increase performance of the iostream by setting the buffer size used internally. This can be achieved by a call to pubsetbuf as explained in this other stackoverflow post.
Community
  • 1
  • 1
diogoslima
  • 169
  • 7
  • 1
    copying the file to a `std::string` is _not_ likely to improve performance. Using a memory mapped file would be much better. – Brad Allred May 10 '17 at 21:12
  • Agreed @BradAllred, I will edit end remove the secondary part of my post as it is misleading. – diogoslima May 10 '17 at 21:15