I know there are many similar questions to this one which have already been asked here, but I have a somewhat more nuanced question than what I could generally find on the web. I'm currently working on somewhat of a toy C compiler just for fun, but I did want to see what type of performance I could get out of it if I were able to focus only on one version of the core language standard without a lot of optional features provided by ex. clang or gcc. As such, I want to be able to efficiently read and process files for my lexer from disk. Given how many header files the average source file includes (especially considering recursive includes) and the number of source files in the average program, efficient file reading will be very important. The two system types I want to target are Linux and macOS. Both of these systems provide two main ways of dealing with file I/O: (buffered or unbuffered) open
and read
calls to read a file as a stream, and mmap
to directly allocate virtual memory into which the file is transparently mapped.
Most askers of similar questions (as above) seem to have had very different use cases: either they are dealing with a small number of very large files or they are dealing with applications where file I/O is not truly a major bottleneck. To be fair, I may be being naive, as I haven't completed the program yet, so I may end up falling into the second group, but I did want to at least see what others thought in this regard. For instance, I know LLVM will use mmap
to read files if they are over the current page size or 16KB, and will use open/read
calls if they are not.
The question is then which of these methods is best when dealing with a large number of files of varying sizes? The goal is to be able to read the files into memory and parse them character by character multiple times (preprocessor and main C language processing). Is there some good threshold I could find where files over a given length should be mapped vs buffered in the heap? Should I just use one of these approaches over the other in all cases? My goal is mainly on speed: I don't want to have to bottleneck on file I/O when I could be parsing code instead.