Getting max FLOPS for dense matrix multiplication with the Xeon Phi Knights Landing

Question

I recently started working with a Xeon Phi Knights Landing (KNL) 7250 computer (http://ark.intel.com/products/94035/Intel-Xeon-Phi-Processor-7250-16GB-1_40-GHz-68-core).

This has 68 cores and AVX 512. The base frequency is 1.4 GHz and the Turbo Frequency is 1.6 GHz. I don't know what the turbo frequency is for all cores because usually the turbo frequency is quoted for only one core.

Each Knights Landing core can do two 8-wide double FMA operations per cycle. Since each FMA operations is two floating point operations the double floating point operations per cycle per core is 32.

Therefore the maximum GFLOPS is 32*68*1.4 = 3046.4 DP GFLOPS.

For a single core the peak FLOPS is 32*1.6 = 51.2 DP GLOPS.

Dense matrix multiplication is one of the few operations that actually is capable of getting close to the peak flops. The Intel MKL library provides optimized dense matrix multiplication functions. On a Sandy Bridge systems I obtained better than 97% of the peak FLOPS with DGEMM. On Haswell I got about 90% of the peak when I checked a few years ago so it was clearly more difficult to obtain the peak with FMA at the time. However, with Knights Landing and MKL I get much less than 50% of the peak.

I modified the dgemm_example.c file in the MKL examples directory to calculate the GFLOPS using 2.0*1E-9*n*n*n/time (see below).

I have also tried export KMP_AFFINITY=scatter and export OMP_NUM_THREADS=68 but that does not appear to make a difference. However, KMP_AFFINITY=compact is significantly slower and so is OMP_NUM_THREADS=1 so the default thread topology appears to be scattered anyway and threading is working.

The best GFLOPS I have seen is about 1301 GFLOPS which is about 43% of the peak. For one thread I have seen 38 GFLOPS which is about 74% of the peak. This tells me that MKL DGEMM is optimized for AVX512 otherwise it would see less than 50%. On the other hand for a single thread I think I should get 90% of the peak.

The KNL memory can operate in three modes (cached, flat, and hybrid) which can be set from the BIOS (http://www.anandtech.com/show/9794/a-few-notes-on-intels-knights-landing-and-mcdram-modes-from-sc15). I don't know what mode my (or rather my work's) KNL system is in. Could this have an impact on DGEMM?

My question is why is the FLOPS from DGEMM so low and what can I do to improve it? Maybe I have not configured MKL optimally ( I am using ICC 17.0).

source /opt/intel/mkl/bin/mklvars.sh  intel64
icc -O3 -mkl dgemm_example.c

Here is the code

#define min(x,y) (((x) < (y)) ? (x) : (y))

#include <stdio.h>
#include <stdlib.h>
#include "mkl.h"
#include "omp.h"

int main()
{
    double *A, *B, *C;
    int m, n, k, i, j;
    double alpha, beta;

    printf ("\n This example computes real matrix C=alpha*A*B+beta*C using \n"
            " Intel(R) MKL function dgemm, where A, B, and  C are matrices and \n"
            " alpha and beta are double precision scalars\n\n");

    m = 30000, k = 30000, n = 30000;
    printf (" Initializing data for matrix multiplication C=A*B for matrix \n"
            " A(%ix%i) and matrix B(%ix%i)\n\n", m, k, k, n);
    alpha = 1.0; beta = 0.0;

    printf (" Allocating memory for matrices aligned on 64-byte boundary for better \n"
            " performance \n\n");
    A = (double *)mkl_malloc( m*k*sizeof( double ), 64 );
    B = (double *)mkl_malloc( k*n*sizeof( double ), 64 );
    C = (double *)mkl_malloc( m*n*sizeof( double ), 64 );
    if (A == NULL || B == NULL || C == NULL) {
      printf( "\n ERROR: Can't allocate memory for matrices. Aborting... \n\n");
      mkl_free(A);
      mkl_free(B);
      mkl_free(C);
      return 1;
    }

    printf (" Intializing matrix data \n\n");
    for (i = 0; i < (m*k); i++) {
        A[i] = (double)(i+1);
    }

    for (i = 0; i < (k*n); i++) {
        B[i] = (double)(-i-1);
    }

    for (i = 0; i < (m*n); i++) {
        C[i] = 0.0;
    }

    printf (" Computing matrix product using Intel(R) MKL dgemm function via CBLAS interface \n\n");
    double dtime;
    dtime = -omp_get_wtime();

    cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, 
                m, n, k, alpha, A, k, B, n, beta, C, n);
    dtime += omp_get_wtime();
    printf ("\n Computations completed.\n\n");
    printf ("time %f\n", dtime);
    printf ("GFLOPS %f\n", 2.0*1E-9*m*n*k/dtime);

    printf (" Top left corner of matrix A: \n");
    for (i=0; i<min(m,6); i++) {
      for (j=0; j<min(k,6); j++) {
        printf ("%12.0f", A[j+i*k]);
      }
      printf ("\n");
    }

    printf ("\n Top left corner of matrix B: \n");
    for (i=0; i<min(k,6); i++) {
      for (j=0; j<min(n,6); j++) {
        printf ("%12.0f", B[j+i*n]);
      }
      printf ("\n");
    }

    printf ("\n Top left corner of matrix C: \n");
    for (i=0; i<min(m,6); i++) {
      for (j=0; j<min(n,6); j++) {
        printf ("%12.5G", C[j+i*n]);
      }
      printf ("\n");
    }

    printf ("\n Deallocating memory \n\n");
    mkl_free(A);
    mkl_free(B);
    mkl_free(C);

    printf (" Example completed. \n\n");
    return 0;
}

@HighPerformanceMark, can you suggest an Intel fora? I rely on SO for everything. — Z boson, Dec 23 '16 at 10:16
It's Christmas, here's my present to you https://software.intel.com/en-us/forums/intel-many-integrated-core — High Performance Mark, Dec 23 '16 at 10:46
@HighPerformanceMark, than you, that appears to be a good resource for me. If I get a useful answer from that forum I will put the results here. — Z boson, Dec 23 '16 at 12:37
Agree about asking on the Intel forum, if you can't find Intel or Colfax pages about benchmarking dgemm on KNL. For a start, yes, you should try the fast memory mode numactl setting. — tim18, Dec 23 '16 at 13:01
FWIW I get ~1850 GFLOPS at turbo and ~1760 at 1300 MHz with a cache-only configuration on a 7210 with has slightly less cores and frequency. You can get the MCDRAM configuration with `numactl -H`. I strongly assume that FLAT would provide the best performance, but Intel highly tuned it for cache as well. — Zulan, Dec 23 '16 at 13:02
@Zulan thanks for the info about `numactl -H`(it turns out that is in HBM modes image from anandtech I linked). It shows 8 nodes. This must mean it's FLAT mode. However, the nodes are not each assigned the same amount of memory. Nodes 0-3 24576 MB each and nodes 4-7 get 4096 MB each. That seems odd. I am surprised you get such a large GFLOPS. Did you run the same code I listed? What do you mean by at turbo? Do you mean you turn turbo on/off from the BIOS because I don't now any other way to control the turbo. — Z boson, Dec 23 '16 at 13:32
I am not sure MCDRAM is even necessary for GEMM. DRAM should be sufficient. I think MCDRAM is mostly for memory bandwidth bound algorithms. — Z boson, Dec 23 '16 at 13:34
I think this NUMA configuration means that you are running in [SNC-4 sub-NUMA clustering mode](https://colfaxresearch.com/knl-numa/) in conjuction with four separate NUMA domains AND explicitly allocatable MCDRAM. Try changing the clustering mode, that may be even more important than the MCDRAM mode. Also memory is very important for GEMM. It is only the blocking strategies that allow GEMM to benefit from data-reuse / caches. So having an explicitly allocatable fast memory instead of a transparent cache allows for a more controlled optimization. — Zulan, Dec 23 '16 at 13:45
Another strange thing is that only nodes 0-3 have CPUS assigned to them. — Z boson, Dec 23 '16 at 13:47
I do control turbo by selecting "1301" MHz with the Linux userspace governor. — Zulan, Dec 23 '16 at 13:47
I did `sudo hwloc-dump-hwdata` and it says `Cluster mode: SNC-4`. — Z boson, Dec 23 '16 at 13:51
Do you have the chance to change the BIOS settings of this system? In general the clustered NUMA modes don't seem to perform that well. — Zulan, Dec 23 '16 at 14:39
SNC-4 mode should optimize performance of applications running 4 MPI ranks each using OpenMP across its set of cores. I haven't seen a discussion on how dgemm performance would compare with a single rank using all cores; such a mode isn't used to advertise Intel products. — tim18, Dec 23 '16 at 17:00
@Zulan, I managed to get 1891 GFLOPS with `numactl -m 4,5,6,7 ./a.out`. That's 62% of the peak assuming the frequency is 1.4 GHz or 54% of the peak assuming the frequency is 1.6 GHz. Do you know what the turbo frequency for all cores is? In any case this is much better but still a long ways from the peak. `numactl -m 4,5,6,7 ./a.out` runs only on the fast MCDRAM. This limits the matrix size to a bit over 20000x20000 which is fine for testing. I can ask the HPC guys about changing the BIOS next week. — Z boson, Dec 23 '16 at 20:03
It's a bit more complicated even. KNL seems to have the AVX frequencies, meaning that it cannot run nominal frequency with AVX-heavy codes, but then again it has AVX turbo for that. Basically you get an undetermined clock rate for any setting above nominal AVX frequency. Now strangely [this page](https://www.aspsys.com/solutions/hpc-accelerators/intel-xeon-phi/) only lists AVX base frequencies for the other two SKUs - I'm not sure what that means. But basically, if you want to know FLOPS/cycle, you have to measure cycles. Given the significant initialization time, you have to modify the code. — Zulan, Dec 23 '16 at 20:23
Also consider that the turbo frequency may vary for longer running programs or temperature or .... — Zulan, Dec 23 '16 at 20:25
I don't know about AVX frequencies (or have forgotten about it). What exactly is this. This link http://www.realworldtech.com/forum/?threadid=159525&curpostid=159551 says the AVX frequency of my 7250 KNL is only 1.2 GHz. That would make the peak 2511 GFLOPS and my best DGEMM peak is 72% of that. But the link you mentioned selling KNL says the frequencies for all cores for my 7250 is 1.5 GHz. — Z boson, Dec 23 '16 at 20:59
@Zulan "Frequency listed is nominal (non-AVX) TDP frequency. For all-tile turbo frequency, add 100 MHz. For > single-tile turbo frequency, add 200 MHz. For high-AVX instruction frequency, subtract 200 MHz" — Z boson, Dec 23 '16 at 21:00
@Zulan when did this AVX frequency start? I mean what microarch? I seam to recall something about this now on SO. It's probably in Agner Fog's manuals. I'll check. — Z boson, Dec 23 '16 at 21:10
@Zulan found a post by you about AVX frequencies (I had already upvoted it) http://stackoverflow.com/questions/35041597/performance-degradation-of-matrix-multiplication-of-single-vs-double-precision-a/35398899#35398899 — Z boson, Dec 23 '16 at 21:16
@Zulan, sorry to belabor this but here is another interesting link about AVX frequencies http://www.anandtech.com/show/8423/intel-xeon-e5-version-3-up-to-18-haswell-ep-cores-/5 and I think it may be related to this discussion http://www.agner.org/optimize/blog/read.php?i=142#378 — Z boson, Dec 23 '16 at 21:32
They got the AVX frequencies for KNL from here http://www.intel.com/content/www/us/en/processors/xeon/xeon-phi-processor-overview.html — Z boson, Dec 23 '16 at 21:37
@Zulan, so I measured the frequency using my technique described here http://stackoverflow.com/a/25400230/2542702. I changed the latency for `_mm512_add_ps` to six based on Agner Fog's measurements of KNL. I get 1.6 GHz for a single core and 1.5 GHz for all cores. It appears the my KNL 7250 is not effected by the AVX frequency which would explain why it's not listed for the 7250. It's possible I guess that the frequency downgrading only happens when the pipeline is full to get full throughput and not when it's bound by latency which I use to measure the frequency. — Z boson, Dec 24 '16 at 12:49
I think there is a good chance that you are right, but I'd be careful with assumptions of a fixed latency for a given instruction. I would rather recommend measuring the clock rate using perf or PAPI. — Zulan, Dec 24 '16 at 19:14
@huseyintugrulbuyukisik, for x86 cores yes. DP SIMD float operations are twice as slow as SP SIMD in general (with some exceptions such as `sqrt`). For graphics processors DP is often much slower. This includes Intel HD graphics which as far as I know DP is 4 times slower than SP. On AMD and Intel it's as bad as 32 times slower. For scalar operations SP and DP are the same speed for x86 (with the same exceptions such as `sqrt`). — Z boson, Feb 22 '17 at 20:28
Did you try using 2*68 threads? Each core has 2 VPUs, so in order to saturate them fully, you need to use 2 SMT threads. — Martin Ueding, Nov 27 '17 at 20:23
@MartinUeding, I think I tried 68, 2*68, and 4*68 threads but it's been awhile. — Z boson, Nov 28 '17 at 11:13

Getting max FLOPS for dense matrix multiplication with the Xeon Phi Knights Landing

0 Answers0