3

My formula for estimating the maximum FLOPs/s of an Intel CPU is

Max SP FLOPs/s = frequencey * 4 SSE(8AVX) * 2 (MAC) * number of cores (not HW threads)
Max DP FLOPs/s = 0.5 * Max SP FLOPs/s

By MAC I mean the CPU can do one SSE(AVX) multiplication and addition at the same time. On the system I'm using the maximum frequency under load is 2.66 GHz. It only has SSE and the number of cores (not Hardware threads) is 4. That gives: Max SP FLOPs/s = 85.12 GFLOPs/s.

The number of FLOPs for matrix multiplication is approxiamelty 2*n*m*k. For a square matrix with n=1000 that's 2*10E9 FLOPs (2 billion FLOPs). Once I know the time I can estimate the FLOPs/s.

However, the best I can get for my own code is about 40 SP GFLOPs/s for example with n=1000. I get about the same result with Eigen. That's about a 45% efficiency. Is my calculation for the maximum wrong? What's the best efficiency for a Intel CPU for large dense matrix multiplication? Does anyone have a paper describing this?

I know that on the GPU the efficiency can be more than 60%. http://www.anandtech.com/show/6774/nvidias-geforce-gtx-titan-part-2-titans-performance-unveiled/3

Edit: I get similar results for n=500 which easily fits in the 12MB L3 cache of my system so the cache does not seem to be the limiting factor (though maybe I can use it more efficiently).

Edit2: Eigen Benchmarks show it as good as MKL (for SSE). They use a Intel(R) Core(TM)2 Quad CPU Q9400 @ 2.66GHz. So 2.66* 2(DP SSE) *2 MAC * 4 cores = 42.25 DP GFLOPs/s. You can see on the plot they are all getting less that 20. Something on order of 45% like me. http://eigen.tuxfamily.org/index.php?title=Benchmark

http://ark.intel.com/products/35365/Intel-Core2-Quad-Processor-Q9400-6M-Cache-2_66-GHz-1333-MHz-FSB

Edit3: Here is my code for anyone that cares. I can get slight better results than this but not much better. I'm using Agner Fog's vectorclass for SEE/AVX. and setting Vec8f to float8 and Vec4d to double4

//SGEMM and AVX call MM_tile<float, float8>(nthreads, a, b, c, n, m, k);
template <typename ftype, typename floatn>
void GEMM_tile(const int nthreads, const ftype*A , const ftype* B, ftype* C, const int N, const int M, const int K) {       
    for(int i=0; i<N; i++) {
       for(int j=0; j<K; j++) {
           C[K*i + j] = 0;
       }
    }   
    const int nc = 32;
    const int kc = 32;
    const int mc = 32;
    omp_set_num_threads(nthreads);
    #pragma omp parallel for if(nthreads>1)
    for(int ii=0; ii<N; ii+=nc) {
        for(int jj=0; jj<K; jj+=kc)
            for(int ll=0; ll<M; ll+=mc) {
                const int nb = min(N-ii, nc);
                const int kb = min(K-jj, kc);
                const int mb = min(M-ll, mc);
                MM_block<ftype, floatn>(nb, mb, kb, &A[M*ii+ll], N, &B[K*ll+jj], K, &C[K*ii+jj], K );
            }
     }
}

template <typename ftype, typename floatn>
void MM_block(int n, int m, int k, const ftype *a, const int stridea, 
                                   const ftype *b, const int strideb,
                                   ftype *c, const int stridec ) {
    const int vec_size = sizeof(floatn)/sizeof(ftype);
    for(int i=0; i<n; i+=4) {
        for(int j=0; j<k; j+=vec_size) {
            Dot4x4_vec_block<ftype, floatn>(m, &a[strideb*i], &b[j], &c[stridec*i + j], stridea, strideb, stridec);
    }
}

template <typename ftype, typename floatn>
inline void Dot4x4_vec_block(const int n, const ftype *a, const ftype *b, ftype *c, const int stridea, const int strideb, const int stridec) {
    floatn tmp0, tmp1, tmp2, tmp3;
    load(tmp0, &c[stridec*0]);
    load(tmp1, &c[stridec*1]);
    load(tmp2, &c[stridec*2]);
    load(tmp3, &c[stridec*3]);

    ftype *a0_ptr = (ftype*)&a[stridea*0];
    ftype *a1_ptr = (ftype*)&a[stridea*1];
    ftype *a2_ptr = (ftype*)&a[stridea*2];
    ftype *a3_ptr = (ftype*)&a[stridea*3];
    for(int i=0; i<n; i++) {
        floatn breg = floatn().load(&b[i*strideb + 0]);

        floatn areg0 = *a0_ptr++;
        floatn areg1 = *a1_ptr++;
        floatn areg2 = *a2_ptr++;
        floatn areg3 = *a3_ptr++;

        tmp0 += areg0 * breg;
        tmp1 += areg1 * breg;
        tmp2 += areg2 * breg;
        tmp3 += areg3 * breg;
}
    tmp0.store(&c[stridec*0]);
    tmp1.store(&c[stridec*1]);
    tmp2.store(&c[stridec*2]);
    tmp3.store(&c[stridec*3]);
}
  • I'll write more later, but good BLAS implementations asymptotically achieve ~90% of peak on modern x86 hardware. MKL and GotoBLAS are both good reference points. Pushing beyond about 60% requires carefully-tuned inner loops with consideration given to instruction latency and port utilization. It often requires writing assembly, unless you happen to hit on an intrinsic idiom that your compiler handles really well. – Stephen Canon Apr 16 '13 at 16:58
  • Thanks, 90% is what I was hoping for. I added some text on Eigen vs MKL. It's as good as MKL they claim (for SSE). The interesting thing is my formula would predict over 40 DP GLOPs/s and they all get less than 20. Maybe my formula is off by a factor of 2? –  Apr 16 '13 at 17:44
  • I found some code to profile MKL. They have a chart for the i7-2600K. They get 90 DP GFLOPs/s. I get 50 DP GFLPS/s on my 2600K at home. That tells me my formula is correct and that my code (and Eigen's) is inefficient. http://software.intel.com/en-us/articles/a-simple-example-to-measure-the-performance-of-an-intel-mkl-function –  Apr 16 '13 at 19:16

1 Answers1

0

Often, the limiting factor for processing throughput is memory bandwidth, especially in cases where your working set doesn't fit into the CPU cache (your 1000-by-1000 matrix of float will take up ~4 MB, whereas your CPU probably has a 2 MB L3 cache). This is a situation where the structure of your algorithm can make a big difference in how it performs, but you will usually hit a wall at some point where you just can't get any faster because you're waiting on values to come from some higher level in the memory hierarchy.

In addition, your theoretical numbers assume that you have sufficient instructions without data dependencies to keep all of the execution units tasked on every cycle. This can be very difficult to do in practice. I'm not sure what the optimum throughput for a general matrix multiply would be, but check out this previous question for information on what you can do to maximize the instruction throughput.

Community
  • 1
  • 1
Jason R
  • 11,159
  • 6
  • 50
  • 81
  • I'm using loop tiling/blocking to deal with the the cache. I have heard of using multiple levels of blocking for L1, L2, L3. I'm not doing that yet. I use loop unrolling to dealt with data dependencies. Anyway, my results are about as good as Eigen which I assume is the state of the art for SSE (not AVX). So if my code is inefficient then so is Eigen's. –  Apr 16 '13 at 14:33
  • It doesn't seem to me that your code is significantly less efficient than one would expect to be able to get. It sounds like you've done the right things to make sure your algorithm is as memory-friendly as possible. – Jason R Apr 16 '13 at 15:04
  • A properly-tiled matrix multiply algorithm is *never* limited by memory bandwidth (on modern architectures, including x86). The limiting constraint on performance is raw compute throughput. – Stephen Canon Apr 16 '13 at 16:56
  • Thanks, that's what I thought. I think that's because matrix multiply is a order n^3 so it spends far more time processing data then reading data so the limiting factor should be the processing. –  Apr 16 '13 at 17:49