Improving C++'s reading file line by line?

Question

I am parsing a ~500GB log file and my C++ version takes 3.5 minutes and my Go version takes 1.2 minutes.

I am using C++'s streams to stream each line of the file in to parse.

#include <fstream>
#include <string>
#include <iostream>

int main( int argc , char** argv ) {
   int linecount = 0 ;
   std::string line ;
   std::ifstream infile( argv[ 1 ] ) ;
   if ( infile ) {
      while ( getline( infile , line ) ) {
          linecount++ ;
      }
      std::cout << linecount << ": " << line << '\n' ;
   }
   infile.close( ) ;
   return 0 ;
}

Firstly, why is it so slow to use this code? Secondly, how can I improve it to make it faster?

*why is it so slow to use this code* first measure again without using the std::cout part - you're now measuring file I/O and printing to console — stijn, Dec 28 '15 at 12:57
When you say that you are "parsing" a file, what do you mean by that? Is all you're doing reading and counting lines? — Some programmer dude, Dec 28 '15 at 13:00
This kind of question benefits from knowing the exact C++ implementation. Unlike Go, C++ has multiple independent implementations. — MSalters, Dec 28 '15 at 13:03
Have you tried std::ios_base::sync_with_stdio(false); ? And std::cin.rdbuf()->pubsetbuf(buffer, sizeof(buffer)); with a decent sized char buffer[]; — Tony Delroy, Dec 28 '15 at 13:05
Did you compile with optimizations turned on? By default they are not unless you use a release build. — NathanOliver, Dec 28 '15 at 13:12
Reading 500GB in 80 seconds is 50Gbit/s. For sure beats any hard disk I ever tried. — Support Ukraine, Dec 28 '15 at 13:52
@StillLearning 500 GB in 72 sec is under 7 GB/sec. Still pretty impressive given SATA-3 is 6 Gbit/sec - about an order of magnitude lower than the performance claimed. — Andrew Henle, Dec 28 '15 at 13:58
@AndrewHenle - ehh...? 7 GB/sec is about the same as the 50 Gbit/s I wrote. Not sure what you mean? Is your comment because I rounded the 72 sec to 80 sec? I did that so I didn't need a calculator.... — Support Ukraine, Dec 28 '15 at 14:03
One more [related question](http://stackoverflow.com/questions/8809607/fast-file-reading). — stgatilov, Dec 28 '15 at 18:50
@jimjampez: I think it would be great to know which hard drive was used. Right now it is hard to believe that you really read your file from a hard drive (even SSD is slower). — stgatilov, Dec 28 '15 at 18:59

Ralph Tandetzky · Accepted Answer · 2015-12-28T23:17:11.867

The C++ standard libraries iostreams are notoriously slow and this is the case for all different implementations of the standard library. Why? Because the standard imposes lots of requirements on the implementation which inhibit best performance. This part of the standard library was designed roughly 20 years ago and is not really competitive on high performance benchmarks.

How can you avoid it? Use other libraries for high performance async I/O like boost asio or native functions that are provided by your OS.

If you want to stay within the standard, the functionstd::basic_istream::read() may satisfy your performance demands. But you have to do your buffering and line counting yourself in this case. Here's how it can be done.

#include <algorithm>
#include <fstream>
#include <iostream>
#include <vector>

int main( int, char** argv ) {
   int linecount = 1 ;
   std::vector<char> buffer;
   buffer.resize(1000000); // buffer of 1MB size
   std::ifstream infile( argv[ 1 ] ) ;
   while (infile)
   {
       infile.read( buffer.data(), buffer.size() );
       linecount += std::count( buffer.begin(), 
                                buffer.begin() + infile.gcount(), '\n' );
   }
   std::cout << "linecount: " << linecount << '\n' ;
   return 0 ;
}

Let me know, if it's faster!

score 5 · Answer 2 · edited May 23 '17 at 12:04

Building on @Ralph Tandetzky answer but going down to low-level C IO functions, and assuming a Linux platform using a filesystem that provides good direct IO support (but staying single-threaded):

#define BUFSIZE ( 1024UL * 1024UL )
int main( int argc, char **argv )
{
    // use direct IO - the page cache only slows this down
    int fd = ::open( argv[ 1 ], O_RDONLY | O_DIRECT );

    // Direct IO needs page-aligned memory
    char *buffer = ( char * ) ::valloc( BUFSIZE );

    size_t newlines = 0UL;

    // avoid any conditional checks in the loop - have to
    // check the return value from read() anyway, so use that
    // to break the loop explicitly
    for ( ;; )
    {
        ssize_t bytes_read = ::read( fd, buffer, BUFSIZE );
        if ( bytes_read <= ( ssize_t ) 0L )
        {
            break;
        }

        // I'm guessing here that computing a boolean-style
        // result and adding it without an if statement
        // is faster - might be wrong.  Try benchmarking
        // both ways to be sure.
        for ( size_t ii = 0; ii < bytes_read; ii++ )
        {
            newlines += ( buffer[ ii ] == '\n' );
        }
    }

    ::close( fd );

    std::cout << "newlines:  " << newlines << endl;

    return( 0 );
}

If you really need to go even faster, use multiple threads to read and count newlines so you're reading data while you're counting newlines. But if you're not running on really fast hardware designed for high performance, this is overkill.

score 0 · Answer 3 · edited May 23 '17 at 11:52

The I/O routines from the old good C should be significantly faster than the clumsy C++ streams. If you know reasonable upper bound on the lengths of all lines, than you can use fgets coupled with a buffer like char line[1<<20];. Since you are going to actually parse you data, you might want to simply use fscanf directly from your file.

Note that if your file is physically stored on a hard drive, then hard drive reading speed would become a bottleneck anyway, as noted here. That's why you do not really need the fastest CPU-side parsing in order to minimize processing time, perhaps simple fscanf would suffice.

Improving C++'s reading file line by line?

3 Answers3

Linked