22

The General Situation

An application that is extremely intensive on both bandwidth, CPU usage, and GPU usage needs to transfer about 10-15GB per second from one GPU to another. It's using the DX11 API to access the GPU, so upload to the GPU can only happen with buffers that require mapping for each single upload. The upload happens in chunks of 25MB at a time, and 16 threads are writing buffers to mapped buffers concurrently. There's not much that can be done about any of this. The actual concurrency level of the writes should be lower, if it weren't for the following bug.

It's a beefy workstation with 3 Pascal GPUs, a high-end Haswell processor, and quad-channel RAM. Not much can be improved on the hardware. It's running a desktop edition of Windows 10.

The Actual Problem

Once I pass ~50% CPU load, something in MmPageFault() (inside the Windows kernel, called when accessing memory which has been mapped into your address space, but was not committed by the OS yet) breaks horribly, and the remaining 50% CPU load is being wasted on a spin-lock inside MmPageFault(). The CPU becomes 100% utilized, and the application performance completely degrades.

I must assume that this is due to the immense amount of memory which needs to be allocated to the process each second and which is also completely unmapped from the process every time the DX11 buffer is unmapped. Correspondingly, it's actually thousands of calls to MmPageFault() per second, happening sequentially as memcpy() is writing sequentially to the buffer. For each single uncommitted page encountered.

One the CPU load goes beyond 50%, the optimistic spin-lock in the Windows kernel protecting the page management completely degrades performance-wise.

Considerations

The buffer is allocated by the DX11 driver. Nothing can be tweaked about the allocation strategy. Use of a different memory API and especially re-use is not possible.

Calls to the DX11 API (mapping/unmapping the buffers) all happens from a single thread. The actual copy operations potentially happen multi-threaded across more threads than there are virtual processors in the system.

Reducing the memory bandwidth requirements is not possible. It's a real-time application. In fact, the hard limit is currently the PCIe 3.0 16x bandwidth of the primary GPU. If I could, I would already need to push further.

Avoiding multi-threaded copies is not possible, as there are independent producer-consumer queues which can't be merged trivially.

The spin-lock performance degradation appears to be so rare (because the use case is pushing it that far) that on Google, you won't find a single result for the name of the spin-lock function.

Upgrading to an API which gives more control over the mappings (Vulkan) is in progress, but it's not suitable as a short-term fix. Switching to a better OS kernel is currently not an option for the same reason.

Reducing the CPU load doesn't work either; there is too much work which needs to be done other than the (usually trivial and inexpensive) buffer copy.

The Question

What can be done?

I need to reduce the number of individual pagefaults significantly. I know the address and size of the buffer which has been mapped into my process, and I also know that the memory has not been committed yet.

How can I ensure that the memory is committed with the least amount of transactions possible?

Exotic flags for DX11 which would prevent de-allocation of the buffers after unmapping, Windows APIs to force commit in a single transaction, pretty much anything is welcome.

The current state

// In the processing threads
{
    DX11DeferredContext->Map(..., &buffer)
    std::memcpy(buffer, source, size);
    DX11DeferredContext->Unmap(...);
}
Ext3h
  • 5,713
  • 17
  • 43
  • 1
    it sounds like you are at about 400 M for all 16 threads all together. Pretty low. Can you verify that you do not exceed this in your application? What is the pick memory consumption there? I wonder if you have a memory leak. – Serge Jul 21 '17 at 16:26
  • The peak consumption is around 7-8GB, but that's normal, considering that in total the entire processing pipeline needs >1s of buffering to compensate for all sorts of bottlenecks. Yes, it's "only" 400MB, 25 times per second. And it works just fine, until the base CPU load goes above 50% and the performance of the spin lock suddenly spikes from virtually 0 to ~40-50% of complete CPU utilization. Also affecting other processes on the system at the same time. – Ext3h Jul 21 '17 at 16:33
  • 1
    1. What is your physical memory? can you kill all other active processes? 2. guess #2 since you see the 50% threshold, you might get into some issues with hyperthreading. How many physical cores do you have? 8? Can you disable hyperthreading? Try to run as many threads as there are physical cpus in your case on a clean machine. – Serge Jul 21 '17 at 17:39
  • @Serge 16GB, 2.5 to 4GB baseline depending on whether Visual Studio is also running or not. It's not swapping, that's the first thing I checked. Happens with and without other processes running. 6 cores, but yes, Hyperthreading is active, and I did not think about trying without yet. Will do so on Monday, but it might cause the CPU performance to become a bottleneck. – Ext3h Jul 21 '17 at 23:19
  • this seems to be convoy state in the mutex. A condition which can occur using a mutex more than once per timeslice quantum. Given you are looking for a quick fix, could you back off processing if you detect it. If you could create a 20ms or so gap in data being delivered, the convoy could sort itself out and recover. https://en.wikipedia.org/wiki/Lock_convoy – mksteve Jul 23 '17 at 07:15
  • @mksteve Thank you, that article seems to describe the situation, at least partially. There is no yield / context switch possible while inside that part of kernel, so it's actually the spin lock itself eating CPU cycles rather than context switch overhead. The classic lock convoy was resolved in msvc2015 (Windows 7 introduced a new mutex implementation which is immune, msvc2015 is first version to use it). However, I can neither detect this in time (it's occurring during a memcpy on a continues memory region), nor could I afford to wait 20ms every time this happens, which is extremely often. – Ext3h Jul 23 '17 at 11:17
  • 2
    Win10 seems to have issues (see https://stackoverflow.com/questions/45024029/windows-10-poor-performance-compared-to-windows-7-page-fault-handling-is-not-sc#comment77282633_45024029) when many threads are causing page faults. This costs ca. a factor two. Your workaround is still the best you can do. You should open a ticket at MS support if they can do something about that hot lock which was much cheaper in earlier Windows versions. – Alois Kraus Jul 30 '17 at 17:11

1 Answers1

13

Current workaround, simplified pseudo code:

// During startup
{
    SetProcessWorkingSetSize(GetCurrentProcess(), 2*1024*1024*1024, -1);
}
// In the DX11 render loop thread
{
    DX11context->Map(..., &resource)
    VirtualLock(resource.pData, resource.size);
    notify();
    wait();
    DX11context->Unmap(...);
}
// In the processing threads
{
    wait();
    std::memcpy(buffer, source, size);
    signal();
}

VirtualLock() forces the kernel to back the specified address range with RAM immediately. The call to the complementing VirtualUnlock() function is optional, it happens implicitly (and at no extra cost) when the address range is unmapped from the process. (If called explicitly, it costs about 1/3rd of the locking cost.)

In order for VirtualLock() to work at all, SetProcessWorkingSetSize() needs to be called first, as the sum of all memory regions locked by VirtualLock() can not exceed the minimum working set size configured for the process. Setting the "minimum" working set size to something higher than the baseline memory footprint of your process has no side effects unless your system is actually potentially swapping, your process will still not consume more RAM than the actual working set size.


Just the use of VirtualLock(), albeit in individual threads and using deferred DX11 contexts for Map / Unmap calls, did instantly decrease the performance penalty from 40-50% to slightly more acceptable 15%.

Discarding the use of a deferred context, and exclusively triggering both all soft faults, as well as the corresponding de-allocation when unmapping on a single thread, gave the necessary performance boost. The total cost of that spin-lock is now down to <1% of the total CPU usage.


Summary?

When you expect soft faults on Windows, try what you can to keep them all in the same thread. Performing a parallel memcpy itself is unproblematic, in some situations even necessary to fully utilize the memory bandwidth. However, that is only if the memory is already committed to RAM yet. VirtualLock() is the most efficient way to ensure that.

(Unless you are working with an API like DirectX which maps memory into your process, you are unlikely to encounter uncommitted memory frequently. If you are just working with standard C++ new or malloc your memory is pooled and recycled inside your process anyway, so soft faults are rare.)

Just make sure to avoid any form of concurrent page faults when working with Windows.

Ext3h
  • 5,713
  • 17
  • 43