This is the easiest way I know to do what you want by using already-existing stdio
library functions. (stdio
since that's what you're already using and what you seem to be familiar with. There are other ways to do this with C++ streams.)
stdio
files opened using fopen()
already buffer input, and your OS likely uses a page cache. Adding another layer of buffering in your application means there would be three layers of buffering between the data on disk and your application processing the data: 1) the page cache 2) the stdio
buffer, 3) your chunk
. As @ShadowRanger commented - just use the stdio
buffer, then you can use the standard getline()
function to read lines.
// change size to suit your requirements
#define BUFSIZE ( 16UL * 1024UL * 1024UL )
FILE *fp = fopen( path, "rb" );
// assuming a POSIX OS - could also use malloc()/free()
char *buffer = ( char * ) mmap( NULL, BUFSIZE, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0 );
setvbuf( fp, buffer, _IOFBF, BUFSIZE );
char *line = NULL;
size_t len = 0;
for ( ;; )
{
ssize_t currentLen = getline( &line, &len, fp );
if ( currentLen < 0 )
{
break;
}
// process line
}
free( line );
fclose( fp );
munmap( buffer, BUFSIZE );
You'll need to add error checking along with the proper header files.
That should do exactly what you want, and you don't have to write code that has to figure out where lines end, nor do you have to deal with lines that span across multiple fread()
calls.
And it might be even faster if you bypass the page cache. The above code already uses a 16 MB cache. Additional caching just adds another copy in the data path from disk to application. Since you don't need to seek, and you're not going to re-read data, the page cache in this usage pattern does you no good. On Linux, of your file system supports direct IO, you can do this:
int fd = open( path, O_RDONLY | O_DIRECT );
FILE *fp = fdopen( fd, "rb" );
Note that direct IO has significant restrictions - your IO buffer may have to be page-aligned. One nice thing about mmap()
is that it returns page-aligned memory...
If the filesystem supports direct IO, that will bypass the page cache, and your read operations could be substantially faster and might put a lot less memory pressure on your machine, especially if the file is extremely large.