Understanding memory mapping conceptually

Question

I've already asked this question on cs.stackexchange.com, but decided to post it here as well.

I've read several blogs and questions on stack exchange, but I'm unable to grasp what the real drawbacks of memory mapped files are. I see the following are frequently listed:

You can't memory map large files (>4GB) with a 32-bit address space. This makes sense to me now.
One drawback that I thought of was that if too many files are memory mapped, this can cause lower available system resources (memory) => can cause pages to be evicted => potentially more page faults. So some prudence is required in deciding what files to memory map and their access patterns.
Overhead of kernel mappings and data structures - according to Linus Torvalds. I won't even attempt to question this premise, because I don't know much about the internals of Linux kernel. :)
If the application is trying to read from a part of the file that is not loaded in the page cache, it (the application) will incur a penalty in the form of a page-fault, which in turn means increased I/O latency for the operation.

QUESTION #1: Isn't this the case for a standard file I/O operation as well? If an application tries to read from a part of a file that is not yet cached, it will result in a syscall that will cause the kernel to load the relevant page/block from the device. And on top of that, the page needs to be copied back to the user-space buffer.

Is the concern here that page-faults are somehow more expensive than syscalls in general - my interpretation of what Linus Torvalds says here? Is it because page-faults are blocking => the thread is not scheduled off the CPU => we are wasting precious time? Or is there something I'm missing here?

No support for async I/O for memory mapped files.

QUESTION #2: Is there an architectural limitation with supporting async I/O for memory mapped files, or is it just that it no one got around to doing it?

QUESTION #3: Vaguely related, but my interpretation of this article is that the kernel can read-ahead for standard I/O (even without fadvise()) but does not read-ahead for memory mapped files (unless issued an advisory with madvice()). Is this accurate? If this statement is in-fact true, is that why syscalls for standard I/O maybe faster, as opposed to a memory mapped file which will almost always cause a page-fault?

artless noise · Accepted Answer · 2019-09-11T20:18:52.647

QUESTION #1: Isn't this the case for a standard file I/O operation as well? If an application tries to read from a part of a file that is not yet cached, it will result in a syscall that will cause the kernel to load the relevant page/block from the device. And on top of that, the page needs to be copied back to the user-space buffer.

You do the read to a buffer and the I/O device will copy it there. There are also async reads or AIO where the data will be transferred by the kernel in the background as the device provides it. You can do the same thing with threads and read. For the mmap case you don't have control or do not know if the page is mapped or not. The case with read is more explicit. This follows from,

ssize_t read(int fd, void *buf, size_t count);

You specify a buf and count. You can explicitly place where you want the data in your program. As a programmer, you may know that data will not be used again. Subsequent calls to read can then reuse the same buf from the last call. This has multiple benefits; the easiest to see is less memory use (or at least address space and MMU tables). mmap will not know whether a page is still going to be accessed in the future or not. mmap does not know that only some data in the page was of interest. Hence, read is more explicit.

Imagine you have 4096 records of size 4095 bytes on a disk. You need to read/look at two random records and perform an operation on them. For read, you can allocate two 4095 buffer with malloc() or use static char buffer[2][4095] data. The mmap() must map on average 8192 bytes for each record to fill two pages or 16k total. When accessing each mmap record, the record spans two pages. This results in two page faults per record access. Also, the kernel must allocate four TLB/MMU pages to hold the data.

Alternatively, if read to sequential buffers, only two pages are needed, with only two syscalls (read). Also, if the computation on the records are extensive, the locality of the buffers will make it much faster (CPU cache hits) than the mmap data.

And on top of that, the page needs to be copied back to the user-space buffer.

This copy may not be as bad as you believe. The CPU will cache data so that the next access doesn't have to reload from main memory with can be 100x slower than L1 CPU cache.

In the case above, mmap can take over two times as long as a read.

Is the concern here that page-faults are somehow more expensive than syscalls in general - my interpretation of what Linus Torvalds says here? Is it because page-faults are blocking => the thread is not scheduled off the CPU => we are wasting precious time? Or is there something I'm missing here?

I think the main point is you don't have control with mmap. You mmap the file and have no idea if any part is in memory or not. If you just randomly access the file, then it will keep reading it back from disk and you may get thrashing depending on the access pattern without knowing. If the access is purely sequential, then it may not seem better at first glance. However, by re-reading a new chunk to the same user buffer, the L1/L2 CPU cache and TLB of the CPU will be better utilized; both for your process and others in the system. If you read all chunks to a unique buffer and process sequentially, then they will be about the same (see note below).

QUESTION #2: Is there an architectural limitation with supporting async I/O for memory mapped files, or is it just that it no one got around to doing it?

mmap is already similar to AIO, but it has fixed sizes of 4k. Ie, the full mmap file doesn't need to be in memory to start operating on it. Functionally, they are different mechansims to get a similar effect. They are architecturally different.

QUESTION #3: Vaguely related, but my interpretation of this article is that the kernel can read-ahead for standard I/O (even without fadvise()) but does not read-ahead for memory mapped files (unless issued an advisory with madvice()). Is this accurate? If this statement is in-fact true, is that why syscalls for standard I/O maybe faster, as opposed to a memory mapped file which will almost always cause a page-fault?

Poor programming of read can be just as bad as mmap. mmap can use madvise. It is more related to all the Linux MM stuff that has to happen to make mmap work. It all depends on your use case; Either can work better depending on the access patterns. I think that Linus was just saying that neither is a magic bullet.

For instance, if you read to a buffer that is more memory than the system has and you use swap which does the same sort of deal as mmap, you will be worse. You may have a system without swap and mmap for random read access will be fine and allow you to manage files bigger than actual memory. The setup do do this with read will require a lot more code which often means more bugs or if you are naive you will just get an OOM kill message.^note However, if the access is sequential, read is not as much code and it will probably be faster than mmap.

Additional `read` benefits

For some, read offers use of sockets and pipes. Also, char devices such as a ttyS0, will only work with read. This can be beneficial if you author a command line program that gets file names from the command line. If you structure with mmap, it maybe difficult to support these files.

My definition of async I/O is "application asks kernel to load/store x bytes from offset f and let me (application) know when the load is complete (or) provide a monitoring mechanism to track load/store progress". I'm having a tough time imagining why mmap'd files are different than read/write. In both mechanisms, the kernel just has to update the page cache (in-memory pages) and it's done. What does the backing stores natural page size have to do with any of it? Isn't it up to the device drivers to abstract that away? — skittish, Sep 10 '19 at 00:38
If I async read and want to load <4KB, doesn't the kernel load 4KB from the backing store anyway because that's the unit of operation for the kernel (assuming page size of 4KB)? — skittish, Sep 10 '19 at 00:41
Here is another [good page to examine](https://github.com/angrave/SystemProgramming/wiki/File-System,-Part-6:-Memory-mapped-files-and-Shared-memory); see **advantages of memory mapping a file** and **Difference between read + write and mmap** — artless noise, Sep 10 '19 at 13:58
My intent was not to single out one of them as a panacea. After re-reading the answer + page you linked, I'm thinking maybe the worst case performance of mmap vs. read are not so different. Maybe it's the average case where mmap has the **POTENTIAL** to be worse than read, and that's what I'm missing? This dawned on me because the page mentions that mmap may generate minor page-faults (load TLB entries even when the page is present in physical memory). These could probably happen for read as well, but there is no way to access random offsets with read which exceed the provided buffer size. — skittish, Sep 11 '19 at 19:45
I was trying to understand and come up with some sort of mental model for when using mmap would be beneficial vs. using read. — skittish, Sep 11 '19 at 19:46
My take away from this conversation is 1. When an app issues a read, kernel fetches a few pages from disk, creates kernel page table entries only for the pages that were fetched, and copies data over to user provided buffers appropriately. Entries in the TLB point to user buffers, and there is no page fault (except for the major page fault that loads data for the very first time) — skittish, Sep 11 '19 at 20:01
2. mmap creates lots of page entries for the entire file (one for every 4K block of the file - maybe high cardinality). None of these entries get loaded into the TLB initially (because kernel doesn't know which ones the process wants). When a process tries to access a location for the first time, a minor page fault occurs, and the kernel loads a TLB entry for that location. Everytime the application accesses a new page, there is a minor page fault. There can also potentially be a major page fault if the kernel did not copy over the entire file into the kernel disk cache. — skittish, Sep 11 '19 at 20:02
I think that is very close to my understanding. One thing is that a TLB miss can mean an L1/L2 cache miss. The I/O cache may/may not be in memory and I wouldn't count on this in the future. For instance, if SSD drives become remarkably fast. Say 1/2 to 90% of DDR speed, Linux may do away with the disk cache for that device. `read` is also more portable if you are concerned about running code on many different OSes. For large files random access reads, `mmap` is probably better. For sequential access `read` will probably be better or very similar (portability and benefits noted above). — artless noise, Sep 11 '19 at 20:31
And a new monkey wrench is [`userfaultfd`](http://man7.org/linux/man-pages/man2/userfaultfd.2.html). Ie, roll your own `mmap`. It is used by the [umap project](https://github.com/LLNL/umap). For the **CURRENT** Linux 4k is read to the I/O cache. I am not sure it is wise to count on this. Maybe not all OSes require disk size to be MMU size. (512B FAT?) Certainly it is technically feasible for many devices to return faster with smaller `read` size than 4k and someone may decide this is a **win** inside Linux at some point. I think the argument of having data in the I/O cache is brittle. — artless noise, Sep 11 '19 at 20:48
Part of the metric is *I/O_speed/mem_speed >= mem_speed/cache_speed*; this depends on the system and I/O source. The 2nd is do you simply transfer the data or are you crunching the data (where it is important to be in cache). If you are simply transferring then [DMA-to-DMA](https://stackoverflow.com/questions/18343365/zero-copy-networking-vs-kernel-bypass/18346526#18346526) or [splice](http://man7.org/linux/man-pages/man2/splice.2.html) is better. — artless noise, Sep 11 '19 at 20:59

Understanding memory mapping conceptually

1 Answers1

Additional read benefits

Additional `read` benefits