This question concerns the interpretation of Stream Triad results on an Intel Xeon E5-2650v4 processor. This processor has 2 sockets with 12 cores each. The shared L3 cache on each socket is 30 MB
i.e. 30/12 = 2.5 MB/core
. Thus, array size in the OpenMP version of STREAM benchmark = 4 * ((30+30) * 1024 * 1024)/8 = 31,457,280
double elements which is approximately 32,000,000
(32 million) double elements. I use Intel icc 17.0.1 with flags -O3 -xHost and vary threads from 1 to 24
. I obtained the following graph for the Stream Triad : . My questions are:
The maximum bandwidth obtained is around
114 GB/sec
but this is more than the theoretical maximum bandwidth of76 GB/sec
(see Maximum Theoretical Bandwidth of E5-2650v4). How is this possible ?Why is there a variation between bandwidths for even and odd number of threads after 10 threads ?
The maximum bandwidth is obtained at
12
threads (i.e.114 GB/sec
). Is it because OpenMP is automatically spawning 6 threads in one socket and 6 threads in the second socket ?STREAM reports that it needs around
734.2 MB
of memory for this run. The RAM of one socket is enough to satisfy this. So will the memory be allocated on socket 1 or socket 2 (it surely cannot be split up as the arrays are contiguous) ? Further, if the memory is allocated one a single socket, then will not threads on the other socket access a non-local memory (penalty) ?
I will be grateful for any suggestions/solutions. Thanks.