0

I've been reading benchmarks that test the benefits of systems with Multiple Memory Channel Architectures. The general conclusion of most of these benchmarks is that the performance benefits of systems with greater numbers of memory channels over those systems with fewer channels are negligible.

However nowhere have I found an explanation of why this is the case, just benchmark results indicating that this is the real world performance attained.

The theory is that every doubling of the system's memory channels doubles the bandwidth of memory access, so in theory there should be a performance gain, however in real world applications the gains are negligible. Why?

My postulation is that when the NT Kernel allocates physical memory it is not disturbing the allocations evenly across the the memory channels. If all of a process's virtual memory is mapped to a single memory channel within a MMC system then the process will effectively only be able to attain the performance of having a single memory channel at its disposal. Is this the reason for negligible real world performance gains?

Naturally a process is allocated virtual memory and the kernel allocates physical memory pages, so is this negligible performance gain the fault of the NT Kernel not distributing allocations across the available channels?

Hadi Brais
  • 22,259
  • 3
  • 54
  • 95
Duncan Gravill
  • 4,552
  • 7
  • 34
  • 51
  • 1
    It would be nice if you can add a link to the experimental evaluation you mentioned. – Hadi Brais Feb 13 '19 at 11:13
  • https://techguided.com/single-channel-vs-dual-channel-vs-quad-channel/ https://www.pcworld.com/article/2982965/components/quad-channel-ram-vs-dual-channel-ram-the-shocking-truth-about-their-performance.html https://en.wikipedia.org/wiki/Multi-channel_memory_architecture – Duncan Gravill Feb 13 '19 at 15:22

2 Answers2

3

related: Why is Skylake so much better than Broadwell-E for single-threaded memory throughput? two memory controllers is sufficient for single-threaded memory bandwidth. Only if you have multiple threads / processes that all miss in cache a lot do you start to benefit from the extra memory controllers in a big Xeon.

(e.g. your example from comments of running many independent image-processing tasks on different images in parallel might do it, depending on the task.)

Going from two down to one DDR4 channel could hurt even a single-threaded program on a quad-core if it was bottlenecked on DRAM bandwidth a lot of the time, but one important part of tuning for performance is to optimize for data reuse so you get at least L3 cache hits.

Matrix multiplication is a classic example: instead of looping over rows / columns of the whole matrix N^2 times (which is too big to fit in cache) (one row x column dot product for each output element), you break the work up into "tiles" and compute partial results, so you're looping repeatedly over a tile of the matrix that stays hot in L1d or L2 cache. (And you hopefully bottleneck on FP ALU throughput, running FMA instructions, not memory at all, because matmul takes O(N^3) multiply+add operations over N^2 elements for a square matrix.) These optimizations are called "loop tiling" or "cache blocking".

So well-optimized code that touches a lot of memory can often get enough work done as its looping that it doesn't actually bottleneck on DRAM bandwidth (L3 cache miss) most of the time.

If a single channel of DRAM is enough to keep up with hardware prefetch requests for how quickly/slowly the code is actually touching new memory, there won't be any measureable slowdown from memory bandwidth. (Of course that's not always possible, and sometimes you do loop over a big array doing not very much work or even just copying it, but if that only makes up a small fraction of the total run time then it's still not significant.)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
2

The theory is that every doubling of the system's memory channels doubles the bandwidth of memory access, so in theory there should be a performance gain, however in real world applications the gains are negligible. Why?

Think of it as a hierarchy, like "CPU <-> L1 cache <-> L2 cache <-> L3 cache <-> RAM <-> swap space". RAM bandwidth only matters when L3 cache wasn't big enough (and swap space bandwidth only matters if RAM wasn't big enough, and ...).

For most (not all) real world applications, the cache is big enough, so RAM bandwidth isn't important and the gains (of multi-channel) are negligible.

My postulation is that when the NT Kernel allocates physical memory it is not disturbing the allocations evenly across the the memory channels.

It doesn't work like that. The CPU mostly only works with whole cache lines (e.g. 64 byte pieces); and with one channel the entire cache line comes from one channel; and with 2 channels half of a cache line comes from one channel and the other half comes from a different channel. There is almost nothing that any software can do that will make any difference. The NT kernel only works with whole pages (e.g. 4 KiB pieces), so whatever the kernel does is even less likely to matter (until you start thinking about NUMA optimizations, which is a completely different thing).

Brendan
  • 35,656
  • 2
  • 39
  • 66
  • 1
    I think that it's not that most real world applications are not bandwidth bound, it's that certain classes of workloads are more memory bandwidth bound than others. For example, HPC workloads (such as numerous SPEC CPU benchmarks) are well known to be very sensitive to the memory bandwidth. We don't know what type of workloads the OP is talking about. Interleaving across channels is configurable and the way the OS allocates physical memory can have a significant impact on performance, at the potential cost of additional energy consumption. – Hadi Brais Feb 13 '19 at 12:20
  • 1
    I searched for "benchmarks that test the benefits of systems with Multiple Memory Channel Architectures" - top 2 results were mostly focused towards gamers. HPC is a much more complex niche. – Brendan Feb 13 '19 at 12:47
  • @Brendan Yes, the authors seem to be concerned with gaming, although some of the benchmark tests simulate image processing. My curiosity relates to imaging. The scenario I have in mind is batch image processing of a large quantity (>100) of image files (RAW >30MB each). There may be 8 cores with 16 threads each performing a task such as upsizing an image by interpolation. Each image file is larger than the total cache available to the entire processor. Are you saying it's statistically unlikely that cores need to load from RAM simultaneously thus perf gain of extra bandwidth is negligible? – Duncan Gravill Feb 13 '19 at 15:05
  • @Brendan Thanks for your response, I think my confusion is clearing. Is the parallel processing of the images perhaps not processing of multiple images in parallel but the parallelization of the processing of a single image, to make best use of the cache, then processing the images serially? If each core was processing a different image in parallel then surely the use of the cache would be very inefficient. – Duncan Gravill Feb 13 '19 at 15:15