3

How to process (in read-only fashion) a big binary file in C/C++ on Linux as fast as possible? Via read or mmap? What buffer size? (No boost or anything.)

Cartesius00
  • 23,584
  • 43
  • 124
  • 195

2 Answers2

6

mmap is faster and optimal for read only applications. See answer here:

https://stackoverflow.com/a/258097/1094175

Community
  • 1
  • 1
Brett McLain
  • 2,000
  • 2
  • 14
  • 32
  • That answer is mostly accurate, but the bit about, "mmap allows all those processes to share the same physical memory pages, saving a lot of memory" is bunk -- the filesystem cache does this. – Brian Cain Dec 20 '11 at 19:31
3

You could use madvise with mmap, and you might also call readahead (perhaps in a separate thread, since it is a blocking syscall).

If you read the file using ordinary read(2), consider using posix_fadvise(2) and pass buffers of 32kbytes to 1Mbytes to read(2)...

Call mmap on big enough regions; at least several dozen of megabytes (assuming you have more than 1Gb of RAM), and if you have a lot of available RAM, on bigger regions (up to perhaps 80% of available RAM).

Take care of resource limits e.g. set with setrlimit

For not too big files (and not too much of them), you could mmap them entirely. You'll need to call e.g. stat to get their size. As a rule of thumb, when reading one (not several) big files on my desktop machine I would mmap it in full it if is less than 3Gb.

If performance is important, take time to benchmark your application and your system, and to tune it accordingly. Getting the parameters (like mmap-ing region size) configurable makes sense.

The /proc/ filesystem, notably inside /proc/self/ from your application, gives several measures (e.g. /proc/self/status, /proc/self/maps, /proc/self/smaps, /proc/self/statm etc.)

GNU libc should use mmap for reading FILEs which you have fopen-ed with "rm" mode.

Basile Starynkevitch
  • 223,805
  • 18
  • 296
  • 547