86

I'm testing the memory bandwidth on a desktop and a server.

Sklyake desktop 4 cores/8 hardware threads
Skylake server Xeon 8168 dual socket 48 cores (24 per socket) / 96 hardware threads

The peak bandwidth of the system is

Peak bandwidth desktop = 2-channels*8*2400 = 38.4 GB/s
Peak bandwidth server  = 6-channels*2-sockets*8*2666 = 255.94 GB/s

I'm using my own triad function from STREAM to measure the bandwidth (full code later)

void triad(double *a, double *b, double *c, double scalar, size_t n) {
  #pragma omp parallel for
  for(int i=0; i<n; i++) a[i] = b[i] + scalar*c[i];
}

Here are results I get

         Bandwidth (GB/s)
threads  Desktop  Server         
1             28      16
2(24)         29     146
4(48)         25     177
8(96)         24     189 

For 1 thread I don't understand why the desktop is so much faster than the server. According to this answer https://stackoverflow.com/a/18159503/2542702 SSE is sufficient to get the full bandwidth of a dual channel system. That's what I observe on the desktop. Two threads only helps slightly and 4 and 8 threads give a worse result But on the server the single threaded bandwidth is much less. Why is this?

On the server I get the best results using 96 threads. I would have thought it would be saturated with far fewer threads. Why are so many threads necessary to saturate the bandwidth on the server? There is a large margin of error in my results and I don't include an error estimate. I took the best result of several runs.

The code

//gcc -O3 -march=native triad.c -fopenmp
//gcc -O3 -march=skylake-avx512 -mprefer-vector-width=512 triad.c -fopenmp
#include <stdio.h>
#include <omp.h>
#include <x86intrin.h>

void triad_init(double *a, double *b, double *c, double k, size_t n) {
  #pragma omp parallel for
  for(size_t i=0; i<n; i++) a[i] = k, b[i] = k, c[i] = k;
}

void triad(double *a, double *b, double *c, double scalar, size_t n) {
  #pragma omp parallel for
  for(size_t i=0; i<n; i++) a[i] = b[i] + scalar*c[i];
}

void triad_stream(double *a, double *b, double *c, double scalar, size_t n) {
#if defined ( __AVX512F__ ) || defined ( __AVX512__ )
  __m512d scalarv = _mm512_set1_pd(scalar);
  #pragma omp parallel for
  for(size_t i=0; i<n/8; i++) {
    __m512d bv = _mm512_load_pd(&b[8*i]), cv = _mm512_load_pd(&c[8*i]);
    _mm512_stream_pd(&a[8*i], _mm512_add_pd(bv, _mm512_mul_pd(scalarv, cv)));
  }        
#else
  __m256d scalarv = _mm256_set1_pd(scalar);
  #pragma omp parallel for
  for(size_t i=0; i<n/4; i++) {
    __m256d bv = _mm256_load_pd(&b[4*i]), cv = _mm256_load_pd(&c[4*i]);
    _mm256_stream_pd(&a[4*i], _mm256_add_pd(bv, _mm256_mul_pd(scalarv, cv)));
  }        
#endif
}

int main(void) {
  size_t n = 1LL << 31LL; 
  double *a = _mm_malloc(sizeof *a * n, 64), *b = _mm_malloc(sizeof *b * n, 64), *c = _mm_malloc(sizeof *c * n, 64);
  //double peak_bw = 2*8*2400*1E-3; // 2-channels*8-bits/byte*2400MHz
  double peak_bw = 2*6*8*2666*1E-3; // 2-sockets*6-channels*8-bits/byte*2666MHz
  double dtime, mem, bw;
  printf("peak bandwidth %.2f GB/s\n", peak_bw);

  triad_init(a, b, c, 3.14159, n);
  dtime = -omp_get_wtime();
  triad(a, b, c, 3.14159, n);  
  dtime += omp_get_wtime();
  mem = 4*sizeof(double)*n*1E-9, bw = mem/dtime;
  printf("triad:       %3.2f GB, %3.2f s, %8.2f GB/s, bw/peak_bw %8.2f %%\n", mem, dtime, bw, 100*bw/peak_bw);

  triad_init(a, b, c, 3.14159, n);
  dtime = -omp_get_wtime();
  triad_stream(a, b, c, 3.14159, n);  
  dtime += omp_get_wtime();
  mem = 3*sizeof(double)*n*1E-9, bw = mem/dtime;
  printf("triads:      %3.2f GB, %3.2f s, %8.2f GB/s, bw/peak_bw %8.2f %%\n", mem, dtime, bw, 100*bw/peak_bw);
}
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Z boson
  • 32,619
  • 11
  • 123
  • 226
  • 16
    A big difference between the Skylake server and the Skylake desktop processor is the interconnect between the cores. The desktop processor has a ringbus interconnect, while the server processor has a mesh network between the cores. Broadwell server CPU's also had a ringbus, but that solution isn't very scalable for higher core counts. Indeed the advantage of Skylake-SP's mesh network is great scalability, but the single thread memory bandwidth is very disappointing. – wim Jun 28 '19 at 10:19
  • 5
    See also [this description of the Skylake-SP memory subsytem](https://www.anandtech.com/show/11544/intel-skylake-ep-vs-amd-epyc-7000-cpu-battle-of-the-decade/5), and the [test results](https://www.anandtech.com/show/11544/intel-skylake-ep-vs-amd-epyc-7000-cpu-battle-of-the-decade/12), which confirms that the single thread memory bandwidth is low. – wim Jun 28 '19 at 10:36
  • 2
    @Zboson: Thanks for the compliment. TBH, I think I'm not enough an expert on memory subsytems to give a definitive answer. I understand that a mesh interconnect is more scalable than a ring bus, but why couldn't they design a mesh interconnect with at least the same single thread DRAM memory bandwidth as a Broadwell server CPU? Would that have cost too much silicon, or too much power(heat)? I can only guess. With AVX-512 you actually want more bandwidth than with AVX2, not less. – wim Jun 28 '19 at 13:07
  • 3
    ....(continued) Note that limited single thread bandwidth makes it easier to produce scalable computing results. (Which reminds me of the well known paper [Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Computers](https://crd-legacy.lbl.gov/~dhbailey/dhbpapers/twelve-ways.pdf) and [Goerg Hager's Blog, Fooling the masses](https://blogs.fau.de/hager/archives/category/fooling-the-masses).) – wim Jun 28 '19 at 13:08
  • @wim: I think if we were comparing Broadwell vs. Broadwell-EP we'd see *similar* results, except that I think BDW-EP would plateau and saturate its ring bus with far fewer threads, instead of (I assume) failing to saturate the memory controllers even with max threads. And yeah, SKX's mesh seems like a bad design mistake for single-threaded performance especially with lower core-count i9 HEDT models; there must be some reason for it. Maybe they thought they'd be able to do better eventually, but didn't manage that before release date. I don't understand the details either. – Peter Cordes Jun 28 '19 at 20:06
  • @PeterCordes: Yes, that's right. But with SKL versus SKL-SP, the ratio between single thread bandwidth and max threads bandwidth is a bit more pronounced maybe. – wim Jun 29 '19 at 13:31
  • 1
    @wim: well yeah, because single-threaded bandwidth is even more crap on SKX than on BDW-EP. And unlike BDW, that includes L3 bandwidth, unfortunately (according to Mysticial). But my point was that the shape of the curve would be different with BDW-EP: quick climb to a plateau, and maybe a decline like ZBoson sees on SKL-client. Instead of almost asymptotic approach to a max. – Peter Cordes Jun 29 '19 at 17:49
  • 1
    @wim, I forgot that Skylake Server intterconnect uses the best based on the KNL. I see more or less the same effect in the HBM (MCDRAM) on the KNL (i.e. low single thread bandwidth, many threads needed to get full bandwidth). In case someone is wondering why I get 16 and Anandtech got 12 GB/s it's because I assume the write (to `a`) requires a read as well so I assume three reads and 1 write and STREAM defaults to 2 reads and 1 write `3/4*16 = 12`. 3 reads seems more appropriate to me unless non-temporal stores are used. – Z boson Jul 01 '19 at 06:30
  • https://en.wikichip.org/wiki/intel/mesh_interconnect_architecture – Z boson Jul 01 '19 at 06:49
  • Have you tried these measurements with "Sub NUMA Clustering" (formerly "Cluster on Die")? This can tell you something about the bottleneck. Unfortunately my SKL BIOS doesn't seem to support this. – Zulan Jul 01 '19 at 12:17
  • A stupid question follows: Are the cores running at same/similar frequency? Desktop SKUs generally have higher clocks than server processors (not sure about throttling while running SSE code). Also, have you checked your performance counters for backend stalls? It is possible (though unlikely) that the latency due to the mesh is causing a full ROB – TSG Jul 24 '19 at 10:05
  • @Zulan, no, but that is a good idea. I looked into Sub NUMA clustering on the KNL but not on SkylakeX yet. – Z boson Aug 01 '19 at 11:58
  • Could you post the results you get with SIMD/SSE implementations? Does that help single core to saturate memory in desktop and workstation CPUs? – DragonSpit Aug 15 '19 at 11:05
  • @DragonSpit, I won't be posting any new results anytime soon (working on other things now). I did post the code using stream SIMD so you can't test it if you like. – Z boson Aug 16 '19 at 08:24
  • 1
    It is worth noting that many server installations include two sockets instead of one, therefore from a single thread you may access both near and far memory, so the increased access latency in the latter may kill the overall throughput of the execution. IIRC modern libc allocator is aware of the locality, but it's worth checking it via tools like `hwloc` or similar. – Jorge Bellon Sep 17 '19 at 08:13
  • http://www.eecg.toronto.edu/~enright/Kannan_MICRO48.pdf – Z boson Jan 24 '20 at 08:30
  • 1
    Higher thread counts are often necessary to deal with threads blocked on i/o including paging. The additional threads occupy the CPU when the other threads are blocked. If a task has 50% blocking, it takes 2 to busy a CPU. – David G. Pickett Jul 02 '20 at 20:03
  • 1
    Bandwidth can also be decreased by io activity in RAM, closing DRAM rows supporting a CPU activity. One early computer ran better with memory banked not interlaced because the IO was done in upper memory, leaving the CPU alone in lower memory! With many CPUs and DMA IO, the combinations are overwhelming, and thank goodness for multi-level cache. Speaking of cache, writes in many architectures are more expensive than reads. – David G. Pickett Jul 02 '20 at 20:03
  • I -1'd this question, because for an experienced [SO] user you did very little research on google. This question is asked in 2019 and there are several papers from 2017-2018 that actually not only answer this question there are tools available to do so. – Ahmed Masud Sep 25 '20 at 18:55
  • @AhmedMasud well what's the answer then, why is the mesh interconnect memory bandwidth worse than the ring bus for a single hypethread? – Lewis Kelsey Feb 26 '21 at 22:58

1 Answers1

4

The hardware prefetcher is tuned differently on server vs workstation CPUs. Servers are expected to handle many threads, so the prefetcher will request smaller chunks from RAM. Here is a paper that goes into detail about the issue you're experiencing, but from the other side of the coin:

Hardware Prefetcher Aggressiveness Controllers: Do We Need Them All the Time?

Laci
  • 2,738
  • 1
  • 13
  • 22