1

I'm trying to read a 250K line file, and apply regex to each of these lines. However the code is much much slower than Java's readline function. In Java all the parsing is done in ~10 sec, while in C++ it takes more than 2 mins. I've seen the relative C++ ifstream.getline() significantly slower than Java's BufferedReader.readLine()? and added these two lines on top of main:

std::ifstream::sync_with_stdio(false);
std::ios::sync_with_stdio(false);

The rest of the code (I simplified it to remove any delays regex might be causing):

#include "stdafx.h"
#include <ios>
#include <string>
#include <fstream>
#include <iostream>


int _tmain(int argc, _TCHAR* argv[])
{

    std::string libraryFile = "H:\\library.txt";
    std::ios::sync_with_stdio(false);
    std::string line;

    int i = 1;

    std::ifstream file(libraryFile);
    while (std::getline (file, line)) {
        std::cout << "\rStored " << i++ << " lines.";
    }

    return 0;
}

The example seems quite simple, but even the fix suggested in most posts doesn't seem to work. I've run the .exe multiple times using release settings in VS2012, but I just can't reach Java's times.

Community
  • 1
  • 1
Mike Drakoulelis
  • 762
  • 11
  • 25
  • possible duplicate of http://stackoverflow.com/questions/6820765/c-ifstream-getline-significantly-slower-than-javas-bufferedreader-readline – Shreyos Adikari Jun 18 '13 at 16:00
  • 3
    You sure it is isn't the ```std::cout``` that is making it slow? – Travis Pessetto Jun 18 '13 at 16:02
  • I know it is duplicate, I even stated it, but that solution doesn't seem to work on my snippet and this is why I reposted. Could I expand that post instead? I'm not sure what the appropriate action is. – Mike Drakoulelis Jun 18 '13 at 16:03
  • without seeing the respective java code, its quite hard to say anything – Zavior Jun 18 '13 at 16:04
  • Note the `Buffered` in your link. A `BufferedReader` buffers its contents. Does C++' `.getline()` do? – fge Jun 18 '13 at 16:05
  • Even 10s for the Java program is far beyond what's expected. My mid-class notebook takes 0.9ms (yes, milliseconds) using `BufferedReader.readLine` to read a 256kB text file, each line being between 20 and 40 characters long. – jarnbjo Jun 18 '13 at 16:09
  • @TravisPessetto you are right! cout was causing all the delay. I will use printf instead, since it doesn't cause any delay. – Mike Drakoulelis Jun 18 '13 at 16:12
  • @jarnbjo the file I am parsing is actually 40MB in size, as I said it contains 250,000 lines - I hope I wasn't confusing with the 250k term. – Mike Drakoulelis Jun 18 '13 at 16:14
  • Mike be careful in C++ outputting to the screen it causes your program to give up its turn in the processor. So I am not sure if ```printf``` will do any better, plus there is security concerns with ```printf```. – Travis Pessetto Jun 18 '13 at 16:14
  • @TravisPessetto I am not sure what the security concerns are but I doubt it will cause problems on a simple printf as this (all the variables are checked long before that for inappropriate values). printf works without adding any delay to the process though! – Mike Drakoulelis Jun 18 '13 at 16:22
  • but isn't what @fge said more important ... forget the delays by printf or cout all of which are streams to stdout ... the delay is happening because of the buffering capabilities of buffered reader , which i believe is reading blocks of data and then parsing it line by line ... correct me if i am wrong ... i am unaware of std::getline() implementation ... it too has buffer but is it made to match with block size of filesystem ?? – DarthCoder Jun 18 '13 at 16:26
  • The only real security concerns in `printf` come when you take the format string itself as a user-supplied value (a no-no!) or mismatch the types of the arguments with the expectation in the pattern. (On Unix, the Real Man's way to do output is directly with the `write` syscall.) – Donal Fellows Jun 18 '13 at 16:30
  • @MikeDrakoulelis the biggest concern is when it comes to maintaining the code as someone can change something and not realize that it may effect ```printf``` somewhere. Screen output usually is slower because it usually means immediate interaction which moves the program out of the running set and into the wait set. Most disk operations can buffer as they usually are not immediately needed meaning the program can buffer. That is why I thought maybe it was ```std::cout```. – Travis Pessetto Jun 18 '13 at 16:32

1 Answers1

5

The slowness is caused by a couple of things.

  • Mixing cout and cin: The C++ IO library has to synchronize cout every time cin is used. This is to ensure things like input prompts are displayed before asking for input. This really hurts buffering.

  • Using the Windows console output: The Windows console is so slow, especially while doing terminal emulation, that it isn't funny. If at all possible output to a file instead.

Zan Lynx
  • 53,022
  • 10
  • 79
  • 131