Using a memory map and a SIGSEGV handler is a bit problematic. First, mprotect() is not async-signal safe, meaning mprotect()
in a signal handler is not guaranteed to work. Second, synchronization of the necessary structures between the signal handler and more than one thread is quite complex (although possible using GCC __sync and/or __atomic built-ins) as you cannot use the standard locking primitives in signal handlers -- fortunately you can simply return from the signal handler; the kernel does not skip the offending instruction, so the same signal gets raised immediately afterwards.
I did write a small program to test an anonymous private unreserved memory map, using read()
and write()
to update the map. The problem is that other threads may access the map while the signal handler is updating it.
I think it might work if you use a temporary file for the currently active region, with an extra page before and after to hold partial records when the records cross page boundaries.
The actual data file would be represented by a private anonymous unreserved inaccessible map (PROT_NONE
, MAP_ANONYMOUS | MAP_PRIVATE | MAP_NORESERVE
). A SIGSEGV signal handler catches accesses to that map. A page-aligned region of that map is unmapped and mapped from the temporary file (MAP_SHARED | MAP_FIXED | MAP_NORESERVE
). The trick is that the temporary file can be additionally mapped (MAP_SHARED | MAP_NORESERVE
) to another memory region, and the signal handler can simply unmap the temporary file within the map, to stop other threads from accessing the region during conversion; the data is still available to your library functions in the another memory region (to be read from and written to using read()
and write()
to the actual data file). MAP_SHARED
mean the exact same pages (from page cache) are used, and MAP_NORESERVE
means the kernel does not reserve swap or RAM for them.
This approach should work well with respect to threads and locking, but it still suffers from mmap()
, munmap()
, and mremap()
not being async-signal safe. However, if you do have a global variable accessed only atomically causing the signal handler to immediately return if application/library code is modifying the structures and/or maps, this should be reliable.