2

My cpp code needs to read a 7 MB text file of space separated float values. It's taking about 6 seconds to parse the string values into a float array, which is too much for my use case.

I've been checking online and people say it is usually the physical IO that takes time. To eliminate this, I'm reading the file into a stringstream in one shot and using this for the float parsing. Still no improvement in code speed. Any ideas how to get it to run faster ?

Here's my code (replaced the array entries with dummy_f for simplicity):

    #include "stdafx.h"
    #include <iostream>
    #include <fstream>
    #include "time.h"
    #include <sstream>
    using namespace std;

    int main()
    {
      ifstream testfile;
      string filename = "test_file.txt";
      testfile.open(filename.c_str());

      stringstream string_stream;
      string_stream << testfile.rdbuf();

      testfile.close();

      clock_t begin = clock();
      float dummy_f;

      cout<<"started stream at time "<<(double) (clock() - begin) /(double) CLOCKS_PER_SEC<<endl;

      for(int t = 0; t < 6375; t++)
      {

           string_stream >> dummy_f;

           for(int t1 = 0; t1 < 120; t1++)
           {
               string_stream >> dummy_f;
           }
      }

      cout<<"finished stream at time "<<(double) (clock() - begin) /(double) CLOCKS_PER_SEC<<endl;

      string_stream.str("");

      return 0;
     } 

Edit:

Here's a link to the test_cases.txt file https://drive.google.com/file/d/0BzHKbgLzf282N0NBamZ1VW5QeFE/view?usp=sharing

Please change the inner loop dimension to 128 when running with this file (made a typo)

Edit: Found a way to make it work. Declared dummy_f as string and read from the stringstream as a string word. Then used atof to convert the string into float. Time taken is 0.4 seconds which is good enough for me.

  string dummy_f;
  vector<float> my_vector;
  for(int t = 0; t < 6375; t++)
  {

       string_stream >> dummy_f;
       my_vector.push_back(atof(dummy_f.c_str()));
       for(int t1 = 0; t1 < 128; t1++)
       {
           string_stream >> dummy_f;
            my_vector.push_back(atof(dummy_f.c_str()));
       }
  }
masupial
  • 29
  • 4
  • 4
    Do not measure performance in debug builds. –  Aug 27 '15 at 18:54
  • 1
    This is with integers but you should to change it to deal with floating point numbers: http://stackoverflow.com/questions/26736742/efficiently-reading-a-very-large-text-file-in-c – NathanOliver Aug 27 '15 at 18:56
  • @Dieter It takes 6 seconds for the release. Debug mode takes around 10 seconds. That's what's so puzzling. – masupial Aug 27 '15 at 18:59
  • Have you timed subsections? Which ones are taking longest? – jaggedSpire Aug 27 '15 at 19:11
  • You should *reserve* space in your `stringstream` or string. You want to minimize the frequency of resizing the data structures. – Thomas Matthews Aug 27 '15 at 19:16
  • It's pretty hard to make suggestions without being able to RUN your code, so could you make available a test-file that reproduces this? – Mats Petersson Aug 27 '15 at 19:17
  • @ThomasMatthews: You mean the bit that is BEFORE the parsing of the string? – Mats Petersson Aug 27 '15 at 19:18
  • According to your loop, you are not using I/O (at least not to a file). Formatting takes time, unless your file is fixed format and you can write a custom input function. – Thomas Matthews Aug 27 '15 at 19:19
  • @MatsPetersson: My bad, I was going by the question text, not the code. – Thomas Matthews Aug 27 '15 at 19:19
  • Read the entire buffer in, then try your timing with `sscanf` to see if you see similar performance. – Brian Vandenberg Aug 27 '15 at 19:20
  • @MaSu: Have you tried *loop unrolling*? – Thomas Matthews Aug 27 '15 at 19:20
  • Have you considered storing floats in a different format - for example raw memory, each value takes 4 bytes no matter what + almost instant read(in my case, 27ms). –  Aug 27 '15 at 19:23
  • The posted code, less the `#include `, with a file containing 121 random numbers over 6375 line, takes 220ms to read. The file is 6947965 bytes long. [Actually, 0.24s in "actual time", which is what Windows uses]. And adding some code to make sure `dummy_f` in the inner loop is really used doesn't alter that. – Mats Petersson Aug 27 '15 at 19:27
  • @NathanOliver: Different problem. Much, much larger file. – Mats Petersson Aug 27 '15 at 19:31
  • Hi All, Thanks for your reply. @NathanOliver: I am just two months old in cpp, so finding your link very new. Will go through it and try it. – masupial Aug 27 '15 at 19:39
  • @Mats: I am using Visual Studio in Windows 7. What is your env? In any case how can it be so much slower? – masupial Aug 27 '15 at 19:45
  • @MaSu are you timing a Release build or a Debug build? A Debug build will be much slower than a Release build. In a Release build your test code posted above may not give you meaningful timings however since the compiler may recognize that `dummy_f` is never used and optimize out whole chunks of your code. – mattnewport Aug 27 '15 at 19:47
  • @mattnewport: I am sure I am using release. I put cout's in the loop and verified that it's reading the file. (Also as far as I know Visual Studio doesn't do any such optimisation) – masupial Aug 27 '15 at 19:52
  • I am nearly sure that the problem doesn't lie in the code. ~1MB per second for just parsing the text into floating point type variables (avg 10 digits per a number) seems unusually slow for any modern (I'd say very roughly about 10-years-old) machine. You should try your code on other systems and confirm the results. The parsing you are doing is not such a computationally exhausting task. – luk32 Aug 27 '15 at 19:53
  • 1
    Also, please note that when you are asking a performance related question, in order to get meaningful answers, you either need to be very precise about the set up and configuration, so people can take a good guess where is the problem and whether the data makes sense; or even better, do a proper profiling to pinpoint the bottlenecks, so people can help you resolve the actual problem. Otherwise, you will get a bunch of speculative ideas, just like you did; +10 comments and no answer, because it is pretty impossible to give a good one currently. – luk32 Aug 27 '15 at 20:02
  • @MaSu: I'm using Linux, so obviously a completely different environment. But it seems rather strange that the Windows runtime should be THAT much worse. I'm pretty sure I can write something trivial that parses floating point numbers character by character from the file itself, and it will be MUCH faster than that. Something is a bit strange here, I think. – Mats Petersson Aug 27 '15 at 20:12
  • What happens if you change `dummy_f` into `std::string` instead of `float`? – Mats Petersson Aug 27 '15 at 20:12
  • @Mats Ran in 0.2 seconds !! So it's some kind of issue with ">>" operator overloading? – masupial Aug 27 '15 at 20:20
  • I don't KNOW the answer, but seems like whatever method the C++ runtime is using to parse numbers is so inefficient I'd call it a bug... – Mats Petersson Aug 27 '15 at 20:33
  • I changed dummy_f into std::string and then do atof(dummy_f.c_str()) . It's working. The timing is 0.4 seconds now, which is good enough for me. I don't know why ">>" isn't parsing fast enough in Visual Studio. Thanks folks! – masupial Aug 27 '15 at 20:49
  • iostreams floating point parsing being ridiculously slow is [a known problem](https://www.reddit.com/r/cpp/comments/3g3aa1/what_do_you_want_to_see_in_vs_2015_update_1/ctukgdl?context=3) with VS's standard library. (STL in that reddit conversation is Stephan T. Lavavej, VS's standard library maintainer.) – T.C. Aug 28 '15 at 01:15

3 Answers3

0

update: Discussion in comments with @Mats concluded that locking overhead is unlikely to have anything to do with this, so we're back at square one as far as explaining why Visual C++'s library is so slow at parsing floats. Your sample test file looked like it was mostly numbers with magnitude not too far from 1.0, with nothing weird going on. (Intel's FPU in Sandybridge & later doesn't have a perf penalty for denormals anyway, according to Agner Fog's tables.)

As others have said, it's time to profile your code and find out which function is taking all the CPU time. Also, performance counters could tell you if branch mispredictions or cache misses are causing problems.


Every call to cin >> dummy_f requires locking to make sure another thread doesn't modify the input buffer at the same time. Reading 4 or 8 floats at once with scanf("%f%f%f%f", &dummy_array[0], &dummy_array[1], ...) would be a little bit more efficient, if that's where the bottleneck is. (scanf is not a great API for this either, since it needs the address of every array element as a function argument. Unrolling by using multiple conversions in one scanf is still a small performance win, though.)

You're trying to work around this with a stringstream, which may or may not be effective. It's a local variable in a function, so if the compiler can see all the functions and inline them, it can not bother with the locking. There can't be any other threads with access to this variable.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • So locking is not necessary if you run it on Linux, and not necessary for doing string reads on Windows? Or are you just making something up that sounds good to you, without actually reading the comments? – Mats Petersson Aug 27 '15 at 20:32
  • I would say that it's a mightily inefficient lock if it adds 5.8 seconds onto the time of reading as strings, when the lock isn't contended by another thread. – Mats Petersson Aug 27 '15 at 20:39
  • @MatsPetersson: I I'm basically making stuff up, based on how the standard library has to work / how it might be implemented. IDK if this is the correct explanation for the discovery that it's slow on Windows, fast on Linux. IIRC, C stdio and C++ iostreams are both required to be thread-safe, so they either have to do locking, or the compiler has to be clever and do whole-program optimization to see there can't be other threads or signal-handlers. – Peter Cordes Aug 27 '15 at 20:43
  • By the way, I think this questions says that C++11 is NOT threadsafe within streams (although it is about cout, not cin, I can't imagine that cin and cout are that different): http://stackoverflow.com/questions/6374264/is-cout-synchronized-thread-safe – Mats Petersson Aug 27 '15 at 20:51
  • 1
    And I added a `std::mutex mtx` outside the outer loop, and a `std::lock_guard lock(mtx)` in the inner loop. The overall runtime is 1ms longer than the run I did before adding that extra code. That's locking and unlocking 771375 times in one millisecond. I somehow doubt that Microsofts locks are 5800 times slower than Linux std::mutex. – Mats Petersson Aug 27 '15 at 21:09
0

On my Linux machine it only takes <0.3 seconds, so if OP didn't make mistake on Debug/Release build, then it should be a problem unique to Windows:

hidden$ cat read-float.cpp 
#include <fstream>
#include <iostream>
#include <vector>
using namespace std;

int main() {
  ifstream fs("/tmp/xx.txt");
  vector<float> v;
  for (int i = 0; i < 6375; i++) {
    for (int j = 0; j < 129; j++) {
      float f;
      fs >> f;
      v.emplace_back(f);
    }
  }
  cout << "Read " << v.size() << " floats" << endl;
}
hidden$ g++ -std=c++11 read-float.cpp -O3
hidden$ time ./a.out 
Read 822375 floats

real    0m0.287s
user    0m0.279s
sys 0m0.008s

hidden$ g++ -v
Using built-in specs.
COLLECT_GCC=g++
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/4.8/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 4.8.4-2ubuntu1~14.04' --with-bugurl=file:///usr/share/doc/gcc-4.8/README.Bugs --enable-languages=c,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.8 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.8 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-gnu-unique-object --disable-libmudflap --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-4.8-amd64/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-4.8-amd64 --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-4.8-amd64 --with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04) 
Kan Li
  • 8,557
  • 8
  • 53
  • 93
  • It's very strange. It's a problem with using ">>" while reading stream into float and it's apparently happening only in Windows. Reading as string from the stream and then using atof solves the issue. – masupial Aug 28 '15 at 06:05
0

An alternate implementation using atof that runs 3 times faster is pasted below. On my laptop the original stringstream based one takes 2.3 seconds to complete whereas this one completes in under 0.8 seconds for same number of floats.

static char filecontents[10*1024*1024];

int testfun2()
{
  ifstream testfile;
  string filename = "test_file.txt";
  testfile.open(filename.c_str());
  int numfloats=0;
  testfile.read(filecontents,10*1024*1024);
  size_t numBytesRead = testfile.gcount();
  filecontents[numBytesRead]='\0';
  testfile.close();

  clock_t begin = clock();
  float dummy_f;

  cout<<endl<<"started at time "<<(double) (clock() - begin) /(double) CLOCKS_PER_SEC<<endl;

  char* p= filecontents;
  char* pend = p + numBytesRead;
  while(p<pend)
  {
      while(*p && (*p <= ' '))
      {
         ++p; //skip leading white space ,\r, \n
      }
      char* pvar = p;
      while(*p > ' ')
      {
        ++p; //skip over numbers
      }
      if(*p)
      {  *p = '\0';// shorter input makes atof faster.
        ++p;
      }
      if(*pvar)
      {
         dummy_f = atof(pvar);
         ++numfloats;
      }
      //cout << endl << dummy_f;
  }

  cout<<endl<< "finished at time "<<(double) (clock() - begin) /(double) CLOCKS_PER_SEC<<endl;

  cout << endl << "numfloats= " << numfloats;
  return numfloats;
 }
rakeshdn
  • 135
  • 1
  • 7
  • tested this against the second implementation (stream to string to float using atof) and this one appears toy be faster. The second implementation in the question however takes between 0.9 and 1.2 seconds on my machine which is strange. – rakeshdn Aug 27 '15 at 22:20
  • I have many files (similar to this one, but different dimensions) and the code will be reading one of them depending on user inputs,etc. So I can't use a character array with predefined size. But it's alright, I can live with this small difference in speeds. :) – masupial Aug 28 '15 at 06:02