Choosing when to mmap vs read a file in a compiler

Question

I know there are many similar questions to this one which have already been asked here, but I have a somewhat more nuanced question than what I could generally find on the web. I'm currently working on somewhat of a toy C compiler just for fun, but I did want to see what type of performance I could get out of it if I were able to focus only on one version of the core language standard without a lot of optional features provided by ex. clang or gcc. As such, I want to be able to efficiently read and process files for my lexer from disk. Given how many header files the average source file includes (especially considering recursive includes) and the number of source files in the average program, efficient file reading will be very important. The two system types I want to target are Linux and macOS. Both of these systems provide two main ways of dealing with file I/O: (buffered or unbuffered) open and read calls to read a file as a stream, and mmap to directly allocate virtual memory into which the file is transparently mapped.

Most askers of similar questions (as above) seem to have had very different use cases: either they are dealing with a small number of very large files or they are dealing with applications where file I/O is not truly a major bottleneck. To be fair, I may be being naive, as I haven't completed the program yet, so I may end up falling into the second group, but I did want to at least see what others thought in this regard. For instance, I know LLVM will use mmap to read files if they are over the current page size or 16KB, and will use open/read calls if they are not.

The question is then which of these methods is best when dealing with a large number of files of varying sizes? The goal is to be able to read the files into memory and parse them character by character multiple times (preprocessor and main C language processing). Is there some good threshold I could find where files over a given length should be mapped vs buffered in the heap? Should I just use one of these approaches over the other in all cases? My goal is mainly on speed: I don't want to have to bottleneck on file I/O when I could be parsing code instead.

Write a function which hides this detail, just gives you a file, and make one implementation. Then once your compiler works, get back to benchmarking this. — hyde, Mar 14 '22 at 06:52

score 1 · Answer 1 · answered Mar 14 '22 at 06:11

My goal is mainly on speed: I don't want to have to bottleneck on file I/O when I could be parsing code instead.

Both reading and mmaping a file should perform the same amount of I/O -- the kernel will have to read the data from disk into memory either way.

If you have many files smaller than page size, using mmap will waste a lot of memory. This may not matter on 64-bit machine, but you could run out of VM space if your compiler is built in 32-bit mode.

If you are going to parse the same files repeatedly (which is an unusual thing to do in the compiler), you may be better off with mmap.

You could also get drastically different performance results depending on how much memory your machine has, whether it has SSD or spinning disk, etc.

TL;DR: you are unlikely to get a definitive answer -- there are too many variables for one.

Choosing when to mmap vs read a file in a compiler

1 Answers1