Dynamically deciding between memory mapping or reading a file

Question

Memory mapping a file saves a data copy and is thus faster for large files.
Reading a file saves manipulation of MMU and is thus faster for small files.
When reading a large number of files, choosing the best method per file may make a difference.

Do I need to hard code the file size limit to make this decision or is there a "best practise" (heuristic) algorithm to infer the decision file size from some system variables at run time when running on Linux on Intel ?
The files in question will be read linearly (no random access) and are of very different sizes.

Edit: I'm not willing to implement some benchmark algorithm because the difference between mmap and read is small and does not justify such an overhead, even if there is a large number of files to be processed.

2nd edit: This is a general question about good coding habits and not tied to some particular set of files on some particular machine.
Imagine I would like to improve the performance of grep (which is not actually true):
How would one implement linear read of many previously unknown files efficiently ?

Does this answer your question? [mmap() vs. reading blocks](https://stackoverflow.com/questions/45972/mmap-vs-reading-blocks) — stark, Feb 24 '23 at 12:24
@stark, no it doesn't because that question is not about determining the decision file size. And it is not about linear reading one-time only. — Juergen, Feb 24 '23 at 13:38
"*I'm not willing to implement some benchmark algorithm because the difference between mmap and read is small and does not justify such an overhead, even if there is a large number of files to be processed.*" - that makes no sense. Of course you will see the difference after benchmarking. And if not, then you'll know it does not actually matter. Just make sure to run through the files multiple times to reduce the effects of caching/disk access. — rustyx, Feb 24 '23 at 14:57
@rustyx, when writing a general purpose program that is supposed to run on different machines (large or small memory, server or laptop,VM or real hardware, different number of cores, old or new architectures ...) it might be reasonable to use a different decision file size when switching. On the other hand, it is not reasonable to include a benchmark into a general purpose program. — Juergen, Feb 24 '23 at 15:57

score 1 · Answer 1 · answered Feb 24 '23 at 13:08

you are requesting a kind of learning algorhythm, to the need is to have some starting value e.g. based upon hardcoded size and then measuring e.g. the time required to read a fixed amound of data.

The system cannot deliver such information as it cannot guess, what your code will be about to do before even opening a file. It cannot know, that you are about to just read sequencially.

You'll have to code some measuring algo and feed the result back to your decision to either mem map or open certain files. the results could be persisted somehow not having to start from scratch each time, but these measurements ofcause are machine dependent.

I'm not willing to implement a benchmark because the difference between `mmap` and `read` is small and does not justify such an overhead, even if there is a large number of files to be processed. What I want is either a hard limit (maybe 128 kB) or some easy algorithm like "if the file is larger than 1 % of the system free memory, use mmap" or "use mmap if file is larger than page size times number of CPUs" or whatever. — Juergen, Feb 24 '23 at 13:55

Dynamically deciding between memory mapping or reading a file

1 Answers1