The General Situation
An application that is extremely intensive on both bandwidth, CPU usage, and GPU usage needs to transfer about 10-15GB per second from one GPU to another. It's using the DX11 API to access the GPU, so upload to the GPU can only happen with buffers that require mapping for each single upload. The upload happens in chunks of 25MB at a time, and 16 threads are writing buffers to mapped buffers concurrently. There's not much that can be done about any of this. The actual concurrency level of the writes should be lower, if it weren't for the following bug.
It's a beefy workstation with 3 Pascal GPUs, a high-end Haswell processor, and quad-channel RAM. Not much can be improved on the hardware. It's running a desktop edition of Windows 10.
The Actual Problem
Once I pass ~50% CPU load, something in MmPageFault()
(inside the Windows kernel, called when accessing memory which has been mapped into your address space, but was not committed by the OS yet) breaks horribly, and the remaining 50% CPU load is being wasted on a spin-lock inside MmPageFault()
. The CPU becomes 100% utilized, and the application performance completely degrades.
I must assume that this is due to the immense amount of memory which needs to be allocated to the process each second and which is also completely unmapped from the process every time the DX11 buffer is unmapped. Correspondingly, it's actually thousands of calls to MmPageFault()
per second, happening sequentially as memcpy()
is writing sequentially to the buffer. For each single uncommitted page encountered.
One the CPU load goes beyond 50%, the optimistic spin-lock in the Windows kernel protecting the page management completely degrades performance-wise.
Considerations
The buffer is allocated by the DX11 driver. Nothing can be tweaked about the allocation strategy. Use of a different memory API and especially re-use is not possible.
Calls to the DX11 API (mapping/unmapping the buffers) all happens from a single thread. The actual copy operations potentially happen multi-threaded across more threads than there are virtual processors in the system.
Reducing the memory bandwidth requirements is not possible. It's a real-time application. In fact, the hard limit is currently the PCIe 3.0 16x bandwidth of the primary GPU. If I could, I would already need to push further.
Avoiding multi-threaded copies is not possible, as there are independent producer-consumer queues which can't be merged trivially.
The spin-lock performance degradation appears to be so rare (because the use case is pushing it that far) that on Google, you won't find a single result for the name of the spin-lock function.
Upgrading to an API which gives more control over the mappings (Vulkan) is in progress, but it's not suitable as a short-term fix. Switching to a better OS kernel is currently not an option for the same reason.
Reducing the CPU load doesn't work either; there is too much work which needs to be done other than the (usually trivial and inexpensive) buffer copy.
The Question
What can be done?
I need to reduce the number of individual pagefaults significantly. I know the address and size of the buffer which has been mapped into my process, and I also know that the memory has not been committed yet.
How can I ensure that the memory is committed with the least amount of transactions possible?
Exotic flags for DX11 which would prevent de-allocation of the buffers after unmapping, Windows APIs to force commit in a single transaction, pretty much anything is welcome.
The current state
// In the processing threads
{
DX11DeferredContext->Map(..., &buffer)
std::memcpy(buffer, source, size);
DX11DeferredContext->Unmap(...);
}