0

I want to read and process a large text file efficiently, accoding to many posts on SO, it's advised to read large chunk of file into memory and do processing in memory.

I try to do it this way:

FILE* fp = fopen(path, "r");
char chunk[SIZE];
while (fread(chunk, SIZE, 1, fp)) {
  process_chunk_line_by_line(chunk);
}

The problem is, what if the last line in chunkis incomplete, I should not process that last line, should seek fp back by the length of the incomplete last line? Is there any more efficient way to do this?

avocado
  • 2,615
  • 3
  • 24
  • 43
  • Slightly more efficient would be to copy the incomplete line back to beginning of the buffer, and then read into the space after the partial line directly. `seek`ing would likely lead to additional system calls that keeping the file pointer in place would avoid. Side-note: `stdio` already buffers, so odds are, you're already doing chunked reads. Increasing the size of the `stdio` buffers might get you most of the benefit without needing to chunk yourself. – ShadowRanger Oct 20 '16 at 01:46
  • @ShadowRanger, but I have to check incompleteless of last line any way, do I? – avocado Oct 20 '16 at 09:50
  • @ShadowRanger, Andrew's answer clears my doubts, and now I understand what you mean by "increasing the size of stdio buffers", thanks – avocado Oct 21 '16 at 02:17

1 Answers1

2

This is the easiest way I know to do what you want by using already-existing stdio library functions. (stdio since that's what you're already using and what you seem to be familiar with. There are other ways to do this with C++ streams.)

stdio files opened using fopen() already buffer input, and your OS likely uses a page cache. Adding another layer of buffering in your application means there would be three layers of buffering between the data on disk and your application processing the data: 1) the page cache 2) the stdio buffer, 3) your chunk. As @ShadowRanger commented - just use the stdio buffer, then you can use the standard getline() function to read lines.

// change size to suit your requirements
#define BUFSIZE ( 16UL * 1024UL * 1024UL )
FILE *fp = fopen( path, "rb" );

// assuming a POSIX OS - could also use malloc()/free()
char *buffer = ( char * ) mmap( NULL, BUFSIZE, PROT_READ | PROT_WRITE,
    MAP_PRIVATE | MAP_ANONYMOUS, -1, 0 );
setvbuf( fp, buffer, _IOFBF, BUFSIZE );

char *line = NULL;
size_t len = 0;

for ( ;; )
{
    ssize_t currentLen = getline( &line, &len, fp );
    if ( currentLen < 0 )
    {
        break;
    }

    // process line
}

free( line );
fclose( fp );
munmap( buffer, BUFSIZE );

You'll need to add error checking along with the proper header files.

That should do exactly what you want, and you don't have to write code that has to figure out where lines end, nor do you have to deal with lines that span across multiple fread() calls.

And it might be even faster if you bypass the page cache. The above code already uses a 16 MB cache. Additional caching just adds another copy in the data path from disk to application. Since you don't need to seek, and you're not going to re-read data, the page cache in this usage pattern does you no good. On Linux, of your file system supports direct IO, you can do this:

int fd = open( path, O_RDONLY | O_DIRECT );
FILE *fp = fdopen( fd, "rb" );

Note that direct IO has significant restrictions - your IO buffer may have to be page-aligned. One nice thing about mmap() is that it returns page-aligned memory...

If the filesystem supports direct IO, that will bypass the page cache, and your read operations could be substantially faster and might put a lot less memory pressure on your machine, especially if the file is extremely large.

Andrew Henle
  • 32,625
  • 3
  • 24
  • 56
  • According to this one: http://stackoverflow.com/q/17925051/2235936, `getline` doesn't look like a good way to go though. – avocado Oct 20 '16 at 23:49
  • OH, I just found out what you meant by `setvbuf` and `mmap` (which I omitted at first glance), you actually read a large chunk from file and buffer it into an array mapped with `mmap`, and then `getline` would be operate on that in-memory buffer, right? that's clever :-) – avocado Oct 21 '16 at 00:02
  • And, I could remove the `mmap` part if I don't care about *page-aligned memory*, right? I just don't quite get the advantage of `mmap` here. – avocado Oct 21 '16 at 00:07
  • @loganecolss - The default `stdio` buffer size is only a few kilobytes. So you need to get a larger memory buffer from somewhere - whether it's from `mmap()` or `malloc()` or `new` or `posix_memalign()`. If you're running on Linux, larger allocations fall back to `mmap()` anyway. The page-aligned memory is usually necessary for direct IO, which bypasses the page cache. Just reading a large file like this - stream it from start to end, without re-reading or changing any part of it - is a perfect use of direct IO since the page cache provides no benefits when you never re-read or update data. – Andrew Henle Oct 21 '16 at 09:52
  • @loganecolss "According to this one: stackoverflow.com/q/17925051/2235936, `getline` doesn't look like a good way to go though." That's the C++ `getline` that operates on a C++ `ifstream`. Note also that the fastest implementation on that question uses low-level `open()`/`read()` operations - and that could possibly go even faster by using direct IO, and a larger page-aligned buffer. But direct IO isn't supported by all file systems, and on Linux it can be very particular - such as allowing *only* page-sized IO requests, which makes reading/writing the last few bytes of a file impossible. – Andrew Henle Oct 21 '16 at 09:59
  • I see, so `mmap` would be beneficial overall. – avocado Oct 22 '16 at 01:42