5

On modern multi-core platforms parallel performance of memory bandwidth bounded applications often does not scale well with the number of cores. Usually, speedup is observed up to some number of cores, but after that the performance saturates. A synthetic example is the well-known STREAM benchmark, which is often used to report the achievable memory bandwidth, i.e., memory bandwidth at the saturation point.

Consider the following results of the STREAM benchmark (Triad) on a single Xeon E5-2680 with a peak memory bandwidth of 42.7GB/s (DDR3-1333):

1  core  16 GB/s
2  cores 30 GB/s
3+ cores 36 GB/s

STREAM scales well from 1 to 2 cores, but above 3 cores the performance is roughly constant.

My question is: what determines the memory bandwidth that can be achieved by a single CPU core? Since this question is definitely too broad, I narrow it down to the above mentioned architecture: how can I predict that STREAM with 1 thread will give me 16 GB/s from the specs of E5-2680, or by looking at the hardware counters etc?

angainor
  • 11,760
  • 2
  • 36
  • 56
  • I wonder how/if the number of memory channels affect the speed? My CPU with 4 memory channels scales to 4 threads in speed benchmark. You see some improvement above two threads, and I think your machine has 4 channels as well. How does a 2-channel system fare? – avl_sweden Jul 26 '18 at 16:23
  • Related re low bandwidth with 1 core: [Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?](https://stackoverflow.com/q/39260020) - yes, big Xeons have high memory latency and a single core doesn't have enough memory parallelism to saturate DRAM. – Peter Cordes May 23 '22 at 14:32

1 Answers1

2

For a single core the major factor will be the CPU frequency and the CPU micro architecture, that is to say the speed of the single core to make requests to the bus and how well the CPU can predict which memory location you're going to access. The CPU designers go to great lengths to make things appear faster than they really are and hide the effect of latencies, if the memory access is random and the code execution depends on the data you'll have to factor the memory access latency, whereas if you only read a bunch of data and say add it up you'll have the bandwidth. But for a single core, the absolute ceiling is the clock speed.

For multi-threaded access the bottleneck will be the bus and the RAM architecture on the motherboard and the north bridge. So it will depend on your motherboard. You can have 50% slower DRAM but 4 of them in parallel and achieve speedup. Or vice versa.

The question however is very broad. If you want to know more about memory from a programmer's perspective look at What every programmer should know about memory. It has an in-depth description of various factors.

It's a VERY in-depth topic.

PS, as for prediction, it's not quite possible, or not quite practical. Measurement is better, unless you have access to very very detailed specs of the CPU, chipset, motherboard and RAM, and even then it's only an educated guess. You're better off measuring it in real life, under your particular workload.

Martin
  • 911
  • 7
  • 21
  • 2
    For single core, the ceiling should be the memory bandwidth theoretically. Take Intel Haswell i7-4770 CPU as an instance, the L1 level cache load bandwidth is 64Byte/cycle and the frequency is 3.6GHz, so the peak throughput of single core is 64*3.6GB/s which is far large than the memory bandwidth. – user334026 Nov 16 '16 at 11:51
  • Yes, this hints at there being more factors than just clock speed. – avl_sweden Jul 26 '18 at 16:20
  • @user334026: A single core only has a limited number of LFB (Line Fill Buffers) to track requests for incoming cache lines. The L2<->L3 connection ("superqueue") has a few more, like maybe 16 requests, but that's not enough to keep more than 16GB/s of requests in flight on the OP's CPU, given the higher memory latency of big Xeon systems compared to desktop CPUs. [Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?](https://stackoverflow.com/q/39260020) - "client" chips typically have better single-core B/W than a Xeon, nearly maxing out their DRAM. – Peter Cordes May 23 '22 at 14:34