3

This question concerns the interpretation of Stream Triad results on an Intel Xeon E5-2650v4 processor. This processor has 2 sockets with 12 cores each. The shared L3 cache on each socket is 30 MB i.e. 30/12 = 2.5 MB/core. Thus, array size in the OpenMP version of STREAM benchmark = 4 * ((30+30) * 1024 * 1024)/8 = 31,457,280 double elements which is approximately 32,000,000 (32 million) double elements. I use Intel icc 17.0.1 with flags -O3 -xHost and vary threads from 1 to 24. I obtained the following graph for the Stream Triad : Stream Triad for Xeon E5-2650v4. My questions are:

  1. The maximum bandwidth obtained is around 114 GB/sec but this is more than the theoretical maximum bandwidth of 76 GB/sec (see Maximum Theoretical Bandwidth of E5-2650v4). How is this possible ?

  2. Why is there a variation between bandwidths for even and odd number of threads after 10 threads ?

  3. The maximum bandwidth is obtained at 12 threads (i.e. 114 GB/sec). Is it because OpenMP is automatically spawning 6 threads in one socket and 6 threads in the second socket ?

  4. STREAM reports that it needs around 734.2 MB of memory for this run. The RAM of one socket is enough to satisfy this. So will the memory be allocated on socket 1 or socket 2 (it surely cannot be split up as the arrays are contiguous) ? Further, if the memory is allocated one a single socket, then will not threads on the other socket access a non-local memory (penalty) ?

I will be grateful for any suggestions/solutions. Thanks.

Gaurav Saxena
  • 729
  • 1
  • 6
  • 13
  • 2. The slowest thread determines the overall run time. Once the memory channel of a socket is saturated, adding more threads does not increase the bandwidth, therefore the run time of the threads running there increases. It takes approximately the same amount of time to finish the benchmark with 13 (6+7) and with 14 (7+7) threads, but in the latter case the data volume is larger, hence the higher bandwidth (14/13 times higher). The difference decreases with the increasing total number of threads as `1/n`. – Hristo Iliev May 15 '17 at 07:11
  • @HristoIliev: I agree with the bandwidth saturation per socket concept. A doubt : doesn't the total data volume remain constant ? Because whether I divide an array of 32,000,000 elements between 13 threads or 14 threads, the total volume of data transferred should remain equal. (My array size is fixed at 32,000,000 elements even when the number of threads increase - I hope this is correct). – Gaurav Saxena May 15 '17 at 08:51
  • 1
    You can calculate the max throughput yourself. Let's say you have DDR4@2400GHz. The max = (8*2400)*num_channels*num_sockets. So for two sockets and quod channel that's 153.5 GB/s which is twice the number Intel quotes because they only quote it for a single socket. You can find out your memory type with `sudo dmidecode - t memory`. You can get better than DDR4@2400 GHz now so the max Intel quotes is obsolete even for single socket. – Z boson May 15 '17 at 09:16
  • 114/153.5 = 74% of the peak bandwidth. That's pretty good actually (assuming you have DDR4@2400 GHz). – Z boson May 15 '17 at 09:18
  • The source code is incredibly simple. You can see it here https://www.cs.virginia.edu/stream/FTP/Code/stream.c – Z boson May 15 '17 at 09:20
  • 1
    @Zboson: Thank you. I have DDR4@2.4GHz (I think you have a type in DDR4@2400**GHz**). So memory bandwidth = 8 bytes * 2400 MHz * 4 channels * 2 sockets = 150 GB/sec. So, I am getting (114/150) * 100 = 76% of the peak bandwidth. The source code you mentioned is v5.10, I ran the OpenMP version v5.9. Hopefully they'll be similar ! (I don't have administrative rights so couldn't `sudo dmidecode -t memory`. Thank you for letting me know that Intel quotes it for only one socket ( I guess they call a socket a processor, rather we call a processor a socket). Thanks again ! – Gaurav Saxena May 15 '17 at 10:13
  • True, my bad, haven't looked at the source code of the benchmark in a while. The difference probably comes from the scheduler moving the odd thread between the sockets. – Hristo Iliev May 15 '17 at 13:53
  • @GauravSaxena, I think it's worthwhile to write your own triad benchmark. I started writing one for an answer but it's only betting about 50% of the max bandwidth on my system (DDR4 dual channel) and I'm not sure why and I have too many other things to do. But it should be pretty easy to modify the code at my answer [here](http://stackoverflow.com/a/42972674/2542702) for the triad function. Then I would play with how the memory is allocated. – Z boson May 15 '17 at 14:24
  • @Zboson: Thanks for the pointer to the code. I will try my best to understand the code and modify it. – Gaurav Saxena May 17 '17 at 14:38

1 Answers1

4
  1. Two sockets have twice the theoretical memory bandwidth. You should also mention the memory configuration for a more specific analysis of achieved theoretical bandwidth.
  2. If you divide an odd number by two (sockets), they will never get the same amount of threads/work, hence the inefficiency.
  3. That is implementation and OS specific. By default, the Intel OpenMP runtime "will create 12 threads, running freely on the logical processors provided by the operating system". The numbers clearly indicate, that the OS runs the threads spread out on two sockets. For reproducible and clarity, you should manually control thread affinity for measurements, e.g. using KMP_AFFINITY (Intel specific) or OMP_PROC_BIND (generic).
  4. It doesn't necessarily matter for NUMA placement, where the memory was allocated. Most commonly, memory is only allocated during the first access, and then it is decided on which NUMA node it is placed. This is also configurable / system specific. STREAM uses statically allocated memory anyway, but initializes it in a omp parallel for, such that it is first touched by the thread that will be reading/writing later on in another omp parallel for (therefore allocating it to it's NUMA node).
Community
  • 1
  • 1
Zulan
  • 21,896
  • 6
  • 49
  • 109
  • Besides what zulan said, pay attention to the rules about boosting problem size to exceed aggregate last level cache. – tim18 May 14 '17 at 16:54
  • 3
    Relevant to point 3: newer Linux kernels (3.10+) include a NUMA balancing mechanism that migrates memory pages between NUMA nodes, bringing each page closer to the thread that access it the most. Therefore, binding threads to cores is an absolute must. – Hristo Iliev May 15 '17 at 06:51
  • @tim18: Apologies but I did not get what exactly is your point. – Gaurav Saxena May 15 '17 at 08:29
  • @tim18: When I vary threads from 1 to 24, I am keeping the array size fixed at 32,000,000 elements (according to two LLC 30 MB + 30 MB). I hope this is correct. – Gaurav Saxena May 15 '17 at 08:54