Catching when the linux kernel writes a page back to a memory mapped file?

Question

I'm contemplating a system that would let me memory map files and transparently do type conversion on the data they contain. It seems it's possible to catch memory accesses by mmaping a second memory region and making it protected, then catching the segfault when a new page is accessed. This would let me handle the on-read type conversion I need.

However, to be read/write compatible, I'd need some way to catch when the OS is paging part of the memory back to disk so I could do the type conversion the other way before it's written.

Is there any capability for hooking the paging system in this way?

What you want to do sounds similar to encfs, which uses FUSE to provide encryption of files. It has to be able to decrypt on the read and encrypt on the write. Maybe you could make your own FUSE filesystem using encfs as a guide. — Alan Curry, Jul 25 '12 at 22:40
Now there's a thought, I'll look into that. It basically comes down to the same thing. I just want to transform the data in a different way. (edit) Ooh but I just thought that I don't want to have to store my files in a special place to get this behavior, I want it to work for any file, like a filter. I'll still check it out though. — gct, Jul 26 '12 at 00:25
your FUSE fs could just be given a directory name on the native fs as a mount option, and pass everything through to it, letting the lower level fs talk to the block device as it usually does. Stacking! I think that's how encfs works. — Alan Curry, Jul 26 '12 at 00:36

score 3 · Accepted Answer · answered Jul 25 '12 at 18:08

3

What you want is not possible, and reflects a fundamental misunderstanding of mmap. The event of file-backed maps being written back on disk is not relevant, because until this happens, any attempt to read the file will (and must, to conform to POSIX) be read from the modified in-memory copy of the page, not the outdated contents on disk. In other words, the writing back of modified pages to disk is completely transparent to applications, and assuming you never lose power or reboot, it would be completely possible that the modified page is never written back to disk.

Your design just doesn't work. You'll have to do something different if you want this kind of behavior.

answered Jul 25 '12 at 18:08

R.. GitHub STOP HELPING ICE

208,859
35
376
711

Sure I understand that, but _if_ the page _is_ swapped back to disk, for what I want to do to work, I have to intercept that, type convert, and then write _that_ type converted data back to disk. I'll be calling munmap when I'm done to guarantee any modified data is written back to disk. – gct Jul 25 '12 at 18:25
1

No, you don't get it. `munmap` does not guarantee the data is written back to disk. All it does is remove the mapping from your process's virtual address space. The concept of "stored on disk" simply *does not exist* on POSIX systems. Any write made through the `mmap`-obtained mapping is *immediately* visible to any process reading the file by any means, whether `read` or another `mmap`. – R.. GitHub STOP HELPING ICE Jul 25 '12 at 23:37
Fine, there's certainly an msync() function does guarantee changes are written back, so I'll call that as well. – gct Jul 26 '12 at 00:24
And you seem to have stated the exact opposite of what you're claming now here: http://stackoverflow.com/questions/5902629/mmap-msync-and-linux-process-termination – gct Jul 26 '12 at 00:33
No, both answers agree. `msync` is not needed to cause the changes to the file to be visible to other processes/views of the file. All it does is attempt to ensure safety across system crash. `msync` is completely analogous to `fsync`. – R.. GitHub STOP HELPING ICE Jul 26 '12 at 02:15
Again, the whole source of your confusion is thinking that "writing back the changes to disk" matters. The **only** effect of writing data back to disk is that, if you flip the power switch after it's written rather than before, you'll still see the data when you restart the system (if you're lucky). It has **no effect whatsoever** as long as the system is continuously running. *Logically* (as opposed to *physically*), changes to memory mapped files are written *immediately*, always. – R.. GitHub STOP HELPING ICE Jul 26 '12 at 02:21
1

Sorry I think maybe I didn't explain well. I want to memory map a file, but then create a second anonymous memory mapped region to proxy access to the memory mapped file for type conversion. I can handle the read part of that by protecting the anonymous reaching and catching SIGSEGV on it, but it's going the other way, and actually writing to the second mapped region that I was curious if I could handle somehow. – gct Jul 26 '12 at 02:38
When do you want changes to the anonymous second mapping to be stored back to the file mapping? – R.. GitHub STOP HELPING ICE Jul 26 '12 at 03:06
I guess I would say "at a convenient time" for whatever that means. As long as the changes are stored before the process ends or when the file is closed I'd be OK with whatever solution. – gct Jul 26 '12 at 13:58
Then there's no need to catch a particular event to write it back. Just do it on program termination or file close. – R.. GitHub STOP HELPING ICE Jul 27 '12 at 00:12

score 2 · Answer 2 · answered Jul 26 '12 at 12:41

Using a memory map and a SIGSEGV handler is a bit problematic. First, mprotect() is not async-signal safe, meaning mprotect() in a signal handler is not guaranteed to work. Second, synchronization of the necessary structures between the signal handler and more than one thread is quite complex (although possible using GCC __sync and/or __atomic built-ins) as you cannot use the standard locking primitives in signal handlers -- fortunately you can simply return from the signal handler; the kernel does not skip the offending instruction, so the same signal gets raised immediately afterwards.

I did write a small program to test an anonymous private unreserved memory map, using read() and write() to update the map. The problem is that other threads may access the map while the signal handler is updating it.

I think it might work if you use a temporary file for the currently active region, with an extra page before and after to hold partial records when the records cross page boundaries.

The actual data file would be represented by a private anonymous unreserved inaccessible map (PROT_NONE, MAP_ANONYMOUS | MAP_PRIVATE | MAP_NORESERVE). A SIGSEGV signal handler catches accesses to that map. A page-aligned region of that map is unmapped and mapped from the temporary file (MAP_SHARED | MAP_FIXED | MAP_NORESERVE). The trick is that the temporary file can be additionally mapped (MAP_SHARED | MAP_NORESERVE) to another memory region, and the signal handler can simply unmap the temporary file within the map, to stop other threads from accessing the region during conversion; the data is still available to your library functions in the another memory region (to be read from and written to using read() and write() to the actual data file). MAP_SHARED mean the exact same pages (from page cache) are used, and MAP_NORESERVE means the kernel does not reserve swap or RAM for them.

This approach should work well with respect to threads and locking, but it still suffers from mmap(), munmap(), and mremap() not being async-signal safe. However, if you do have a global variable accessed only atomically causing the signal handler to immediately return if application/library code is modifying the structures and/or maps, this should be reliable.

Thanks for the comments. I think I'll be restricting myself to using the memory mapped region from a single thread (the main one). I'd thought about the synchronization problem and decided it wasn't worth the hassle. mprotect not being signal safe is worrisome though. Though as mentioned here: http://stackoverflow.com/questions/2663456/write-a-signal-handler-to-catch-sigsegv it should be safe as a practical matter on linux, which is where I'll be living. — gct, Jul 26 '12 at 14:09
I did not realize it yesterday, but if you use a separate thread and either `sigwaitinfo()` or `signalfd()` to receive SIGSEGV and SIGBUS signals, with both blocked using `sigprocmask()` for all other threads, there should be no problems whatsoever. Remember, both signals will simply be reraised if there is a race condition, as long as you make sure to use `mprotect()` early and late enough. The library state could be limited to that thread, and you could use `pthread_mutex_t` for locking. Do you need real working example code? (I'd only need to clean up my test code.) — Nominal Animal, Jul 27 '12 at 11:04

Catching when the linux kernel writes a page back to a memory mapped file?

2 Answers2