I need a copy-free re-size of a very large mmap file while still allowing concurrent access to reader threads.
The simple way is to use two MAP_SHARED mappings (grow the file, then create a second mapping that includes the grown region) in the same process over the same file and then unmap the old mapping once all readers that could access it are finished. However, I am curious if the scheme below could work, and if so, is there any advantage to it.
- mmap a file with MAP_PRIVATE
- do read-only access to this memory in multiple threads
- either acquire a mutex for the file, write to the memory (assume this is done in a way that the readers, which may be reading that memory, are not messed up by it)
- or acquire the mutex, but increase the size of the file and use mremap to move it to a new address (resize the mapping without copying or unnecessary file IO.)
The crazy part comes in at (4). If you move the memory the old addresses become invalid, and the readers, which are still reading it, may suddenly have an access violation. What if we modify the readers to trap this access violation and then restart the operation (i.e. don't re-read the bad address, re-calculate the address given the offset and the new base address from mremap.) Yes I know that's evil, but to my mind the readers can only successfully read the data at the old address or fail with an access violation and retry. If sufficient care is taken, that should be safe. Since re-sizing would not happen often, the readers would eventually succeed and not get stuck in a retry loop.
A problem could occur if that old address space is re-used while a reader still has a pointer to it. Then there will be no access violation, but the data will be incorrect and the program enters the unicorn and candy filled land of undefined behavior (wherein there is usually neither unicorns nor candy.)
But if you controlled allocations completely and could make certain that any allocations that happen during this period do not ever re-use that old address space, then this shouldn't be a problem and the behavior shouldn't be undefined.
Am I right? Could this work? Is there any advantage to this over using two MAP_SHARED mappings?