Explanation for why effective DRAM bandwidth reduces upon adding CPUs

Question

This question is a spin-off of the one posted here: Measuring bandwidth on a ccNUMA system

I've written a micro-benchmark for the memory bandwidth on a ccNUMA system with 2x Intel(R) Xeon(R) Platinum 8168:

24 cores @ 2.70 GHz,
L1 cache 32 kB, L2 cache 1 MB and L3 cache 33 MB.

As a reference, I'm using the Intel Advisor's roof-line plot, which depicts the bandwidths of each CPU data-path available. According to this, the bandwidth is 230 GB/s.

Strong scaling of bandwidth:

Question: If you look at the strong scaling diagram, you can see that the peak effective bandwidth is actually achieved at 33 CPUs, following which adding CPUs only reduces it. Why is this happening?

Dual-processor hardware (designed for SMP) with a pair of multi-core CPUs can only reflect the micro-architecture (be it QPI or other) in-silicon & MoBo facts. That includes CPU-to-RAM physical-RAM-I/O channels. Your micro-benchmark assumption about "uniformity" of all paths from any CPU-core to any physical-RAM area (which is not guaranteed,unless SMP-hardware was specifically designed so,not to be a NUMA-system) fails.So does an assumed memory-I/O-bandwidth scaling.Not uniform & fails even inside the first core. Check facts on physical structure of CPU-core to MEM-I/O-system. Reality matters — user3666197, May 13 '22 at 20:04
Differences between an "own"-DRAM access and cross-QPI-DRAM access were discussed here - https://stackoverflow.com/questions/4087280/approximate-cost-to-access-various-caches-and-main-memory/33065382#33065382 and elsewhere in Chip / MoBo / GPU manufacturers' detailed design documentations. — user3666197, May 13 '22 at 20:08
@user3666197 So your claim is that even in UMA systems, there can be non-uniform memory access latencies? — Nitin Malapally, May 16 '22 at 08:43
@user3666197 Regarding your second comment, I've taken special care to ensure proper memory-page mapping i.e. so-called "first touch". — Nitin Malapally, May 16 '22 at 08:45

score 5 · Accepted Answer · edited May 23 '22 at 23:50

Overview

This answer provides probable explanations. Put it shortly, all parallel workload does not infinitely scale. When many cores compete for the same shared resource (eg. DRAM), using too many cores is often detrimental because there is a point where there are enough cores to saturate a given shared resource and using more core only increase the overheads.

More specifically, in your case, the L3 cache and the IMCs are likely the problem. Enabling Sub-NUMA Clustering and non-temporal prefetch should improve a bit the performances and the scalability of your benchmark. Still, there are other architectural hardware limitations that can cause the benchmark not to scale well. The next section describes how Intel Skylake SP processors deal with memory accesses and how to find the bottlenecks.

Under the hood

The layout of Intel Xeon Skylake SP processors is like the following in your case:

Source: Intel

There are two sockets connected with an UPI interconnect and each processor is connected to its own set of DRAM. There are 2 Integrated Memory Controller (IMC) per processor and each is connected to 3 DDR4 DRAM @ 2666MHz. This means the theoretical bandwidth is 2*2*3*2666e6*8 = 256 GB/s = 238 GiB/s.

Assuming your benchmark is well designed and each processor access only to its NUMA node, I expect a very low UPI throughput and a very low number of remote NUMA pages. You can check this with hardware counters. Linux perf or VTune enable you to check this relatively easily.

The L3 cache is split in slices. All physical addresses are distributed across the cache slices using an hash function (see here for more informations). This method enable the processor to balance the throughput between all the L3 slices. This method also enable the processor to balance the throughput between the two IMCs so that in-fine the processor looks like a SMP architecture instead of a NUMA one. This was also use in Sandy Bridge and Xeon Phi processors (mainly to mitigate NUMA effects).

Hashing does not guarantee a perfect balancing though (no hash function is perfect, especially the ones that are fast to compute), but it is often quite good in practice, especially for contiguous accesses. A bad balancing decreases the memory throughput due to partial stalls. This is one reason you cannot reach the theoretical bandwidth.

With a good hash function, the balancing should be independent of the number of core used. If the hash function is not good enough, one IMC can be more saturated than the other one oscillating over time. The bad news is that the hash function is undocumented and checking this behaviour is complex: AFAIK you can get hardware counters for the each IMC throughput but they have a limited granularity which is quite big. On my Skylake machine the name of the hardware counters are uncore_imc/data_reads/ and uncore_imc/data_writes/ but on your platform you certainly have 4 counters for that (one for each IMC).

Fortunately, Intel provides a feature called Sub-NUMA Clustering (SNC) on Xeon SP processors like your. The idea is to split the processor in two NUMA nodes that have their own dedicated IMC. This solve the balancing issue due to the hash function and so result in faster memory operations as long as your application is NUMA-friendly. Otherwise, it can actually be significantly slower due to NUMA effects. In the worst case, the pages of an application can all be mapped to the same NUMA node resulting in only half the bandwidth being usable. Since your benchmark is supposed to be NUMA-friendly, SNC should be more efficient.

Source: Intel

Furthermore, having more cores accessing the L3 in parallel can cause more early evictions of prefetched cache lines which need to be fetched again later when the core actual need them (with an additional DRAM latency time to pay). This effect is not as unusual as it seems. Indeed, due to the high latency of DDR4 DRAMs, hardware prefetching units have to prefetch data a long time in advance so to reduce the impact of the latency. They also need to perform a lot of requests concurrently. This is generally not a problem with sequential accesses, but more cores causes accesses to look more random from the caches and IMCs point-of-view. The thing is DRAM are designed so that contiguous accesses are faster than random one (multiple contiguous cache lines should be loaded consecutively to fully saturate the bandwidth). You can analyse the value of the LLC-load-misses hardware counter to check if more data are re-fetched with more threads (I see such effect on my Skylake-based PC with only 6-cores but it is not strong enough to cause any visible impact on the final throughput). To mitigate this problem, you can use software non-temporal prefetch (prefetchnta) to request the processor to load data directly into the line fill buffer instead of the L3 cache resulting in a lower pollution (here is a related answer). This may be slower with fewer cores due to a lower concurrency, but it should be a bit faster with a lot of cores. Note that this does not solve the problem of having fetched address that looks more random from the IMCs point-of-view and there is not much to do about that.

The low-level architecture DRAM and caches is very complex in practice. More information about memory can be found in the following links:

What Every Programmer Should Know About Memory
Introduction to High Performance Scientific Computing (Section 1.3)
Lecture: Main Memory and the DRAM System
Short lectures: Dynamic Random Access Memory (in 7 parts)
Intel® 64 and IA-32 Architectures Software Developer's Manual (Volume 3)

Thank you for the very comprehensive answer. The part about the sliced L3 caches was indeed informative and interesting but I think your first and last paragraphs answer my question at best. If it's OK with you, I'll conclude that it's L3-cache contention caused by hardware prefetching. — Nitin Malapally, May 17 '22 at 09:17
Ok. Good to know. I am ok with this conclusion. I encourage you to check hardware counters to confirm this hypothesis ;) . Modern processors are very complex and it is quite frequent to find out surprising performance behaviours directly due to tiny details of the underlying hardware architecture (sometime not even documented). — Jérôme Richard, May 17 '22 at 11:20
SSE4.1 NT loads (`movntdqa`) are only special on WC memory regions (uncacheable write-combining). On normal WB memory regions, they run like `movdqa` with an extra ALU uop. On WB memory, you need `prefetchnta` (with a carefully-tuned prefetch distance); on Intel CPUs without an inclusive L3, they can fully bypass it and L2, only filling L1d. — Peter Cordes, May 23 '22 at 14:45
@PeterCordes Sorry I meant `prefetchnta` for "non-temporal loads" (as in the linked answer). Good to know for the WC memory (though I almost never used it so far). Are you sure for the L1D filling with WB memory? It is clearly not documented in the Intel doc but it looked like LFB buffers was used (AFAIK they are for stores which [seems to be confirmed by Mr Bandwidth](https://community.intel.com/t5/Software-Tuning-Performance/WB-vs-WC-memory-type/m-p/1016410) though it is still unclear). — Jérôme Richard, May 23 '22 at 23:32
@JérômeRichard: Pretty sure I remember Intel's optimization manual describing what happens for NT prefetch from WB memory. I've never played with WC for DRAM; the use-case for SSE4.1 `movntdqa` is copying back from video RAM, as in Intel's whitepaper about it. NT prefetch and load can't weaken the memory ordering rules (unlike NT store) in case that's relevant. And there aren't enough LFBs to usefully prefetch far enough ahead (latency x bandwidth product is pretty far), especially in an algorithm that's also doing stores; WC NT loads are very sensitive to early eviction. — Peter Cordes, May 23 '22 at 23:37
@PeterCordes I mean even if the L3 is not inclusive, the L2 should still be so NT prefetches should cause cache pollution of the L2 too (due to the write-back write-allocate cache policy). This would be a significant waste. Putting data in the LFB would not pollute any cache. I guess experiments could be able to find out what method is used but I did not find any strong evidences/benchmarks about that yet. — Jérôme Richard, May 23 '22 at 23:39
See https://web.archive.org/web/20120210023754/https://software.intel.com/en-us/articles/copying-accelerated-video-decode-frame-buffers (Intel's whitepaper I was referring to using NT load / NT store to copy back from GPU memory). Notice how they bounce through a buffer in cache to avoid early LFB evictions from NT load or NT store. Also [Do current x86 architectures support non-temporal loads (from "normal" memory)?](https://stackoverflow.com/q/40096894) and [Non-temporal loads and the hardware prefetcher, do they work together?](https://stackoverflow.com/q/32103968) — Peter Cordes, May 23 '22 at 23:41
No Intel CPU I'm aware of has inclusive L2, always NINE. NT prefetch bypasses it on all CPUs. Skylake-client's L2 is less associative than its L1d so that would be pretty bad for some cases of aliasing. Intel's used the same general design since Nehalem, https://www.realworldtech.com/nehalem/7/ states that its L2 is not-inclusive not-exclusive. (Before Nehalem, L2 was the last-level cache, but I don't think it was inclusive then either.) Some AMD CPUs have had Exclusive L2's IIRC. — Peter Cordes, May 23 '22 at 23:43
Re: what are LFBs for: they track incoming and outgoing cache lines from L1d (and loads snoop them). Write coalescing from the store buffer can also happen into LFBs while waiting for an RFO to finish, but only if they're back-to-back and to the same cache line, else it could lose track of memory ordering; BeeOnRope managed to do some useful experiments that hint at that. NT stores commit directly into LFBs instead of L1d cache. See also [Where is the Write-Combining Buffer located? x86](https://stackoverflow.com/a/49960213) — Peter Cordes, May 23 '22 at 23:55
Also [How do the store buffer and Line Fill Buffer interact with each other?](https://stackoverflow.com/q/61129773) — Peter Cordes, May 23 '22 at 23:56

Explanation for why effective DRAM bandwidth reduces upon adding CPUs

1 Answers1

Overview

Under the hood

Linked