how to optimize matrix multiplication (matmul) code to run fast on a single processor core

Question

I am working on parallel programming concepts and trying to optimize matrix multiplication example on single core. The fastest implementation I came up so far is the following:

/* This routine performs a dgemm operation
 *  C := C + A * B
 * where A, B, and C are lda-by-lda matrices stored in column-major format.
 * On exit, A and B maintain their input values. */    
void square_dgemm (int n, double* A, double* B, double* C)
{
  /* For each row i of A */
  for (int i = 0; i < n; ++i)
    /* For each column j of B */
    for (int j = 0; j < n; ++j) 
    {
      /* Compute C(i,j) */
      double cij = C[i+j*n];
      for( int k = 0; k < n; k++ )
          cij += A[i+k*n] * B[k+j*n];
      C[i+j*n] = cij;
    }
}

The results are like below. how to reduce the loops and increase the performance

login4.stampede(72)$ tail -f job-naive.stdout
Size: 480       Mflop/s:  1818.89       Percentage: 18.95
Size: 511       Mflop/s:  2291.73       Percentage: 23.87
Size: 512       Mflop/s:  937.061       Percentage:  9.76
Size: 639       Mflop/s:  293.434       Percentage:  3.06
Size: 640       Mflop/s:  270.238       Percentage:  2.81
Size: 767       Mflop/s:  240.209       Percentage:  2.50
Size: 768       Mflop/s:  242.118       Percentage:  2.52
Size: 769       Mflop/s:  240.173       Percentage:  2.50
Average percentage of Peak = 22.0802
Grade = 33.1204

The old concept of loop unrolling should be taken care of by compiler optimization today. You could declare `const int n` to signal to the compiler that the value of `n` will not change -- allowing potential further compiler optimization. Make sure you are compiling with full optimization, either `-Ofast` or `-O3` depending on your compiler. — David C. Rankin, Feb 15 '16 at 04:36
Apart from the fact that there are faster algorithms for multiplying matrices, your code as it stands is a little cache-heavy. There is no reason to stride through `A` and `C` when in fact you could stride through only `B`. I mean, swap the `i` and `j` loops. This might not give you heaps, but it should be more cache-friendly. You might even want to transpose `B` into a temporary copy so that _all_ N^3 iteration is cache-friendly. If you have access to Intel intrinsics, the more obvious solution is to vectorize your code. — paddy, Feb 15 '16 at 04:39
Before you start parallelizing something, you should figure out what the that state of the art *is*, so that you can attempt something better and tell if you are succeeding. On a single processor, you can use hand-tuned standard libraries such as BLAS (Basic Linear Algebra) https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms These are surprisingly good (including handling such complications as cache effects). Hand-coded loops by people not deeply familiar with the problem usually perform poorly in comparision, and that seems to be where you are starting. — Ira Baxter, Feb 15 '16 at 05:14
You can read here about how good BLAS is compared to simple hand-code loops: http://stackoverflow.com/questions/1303182/how-does-blas-get-such-extreme-performance — Ira Baxter, Feb 15 '16 at 05:17
Issue at 512 due to TLB with 4K pages. Further issues due to lack of L2/L3 blocking. See papers by van de Geijn for details on GEMM optimization. — Jeff Hammond, Feb 15 '16 at 07:17
@paddy is correct, just reorder your loops so that you are predominantly operating on rows at a time. Then you can use intrinsics like https://stackoverflow.com/questions/18499971/efficient-4x4-matrix-multiplication-c-vs-assembly — technosaurus, Oct 04 '17 at 04:11
@ChaithuBablu: you can accept one of the answers by clicking on the grey checkmark below its score. — chqrlie, Jul 15 '19 at 13:19

Jianyu Huang · Answer 1 · 2019-05-12T19:35:37.780

The state-of-the-art implementation of matrix multiplication on CPUs uses GotoBLAS algorithm. Basically the loops are organized in the following order:

Loop5 for jc = 0 to N-1 in steps of NC
Loop4   for kc = 0 to K-1 in steps of KC
          //Pack KCxNC block of B
Loop3     for ic = 0 to M-1 in steps of MC
            //Pack MCxKC block of A
//--------------------Macro Kernel------------
Loop2       for jr = 0 to NC-1 in steps of NR
Loop1         for ir = 0 to MC-1 in steps of MR
//--------------------Micro Kernel------------
Loop0           for k = 0 to KC-1 in steps of 1
                //update MRxNR block of C matrix

A key insight underlying modern high-performance implementations of matrix multiplication is to organize the computations by partitioning the operands into blocks for temporal locality (3 outer most loops), and to pack (copy) such blocks into contiguous buffers that fit into various levels of memory for spatial locality (3 inner most loops).

The above figure (originally from this paper, directly used in this tutorial) illustrates the GotoBLAS algorithm as implemented in BLIS. Cache blocking parameters {MC, NC, KC} determine the submatrix sizes of Bp (KC × NC) and Ai (MC × KC), such that they fit in various caches. During the computation, row panels Bp are contiguously packed into buffer Bp to fit in the L3 cache. Blocks Ai are similarly packed into buffer Ai to fit in the L2 cache. Register block sizes {MR, NR} relate to submatrices in registers that contribute to C. In the micro-kernel (the inner most loop), a small MR × NR micro-tile of C is updated by pair of MR × KC and KC × NR slivers of Ai and Bp.

For the Strassen's algorithm with O(N^2.87) complexity, you might be interested in reading this paper. Other fast matrix multiplication algorithms with asymptotic complexity less than O(N^3) can be easily extended in this paper. There is a recent thesis about the practical fast matrix multiplication algorithms.

The following tutorials might be helpful if you want to learn more about how to optimize matrix multiplication on CPUs:

How to Optimize GEMM Wiki

GEMM: From Pure C to SSE Optimized Micro Kernels

BLISlab: A sandbox for optimizing GEMM for CPU and ARM

A most updated document about how to optimize GEMM on CPUs (with AVX2/FMA) step by step can be downloaded here: https://github.com/ULAFF/LAFF-On-HPC/blob/master/LAFF-On-PfHP.pdf

A Massive Open Online Course to be offered on edX starting in June 2019 (LAFF-On Programming for High Performance): https://github.com/ULAFF/LAFF-On-HPC http://www.cs.utexas.edu/users/flame/laff/pfhp/LAFF-On-PfHP.html

score 3 · Answer 2 · answered Mar 04 '16 at 19:09

My C i quite rusty, and I don't know what of the following the optimizer is already doing, but here goes...

Since virtually all the time is spent doing a dot product, let me just optimize that; you can build from there.

double* pa = &A[i];
double* pb = &B[j*n];
double* pc = &C[i+j*n];
for( int k = 0; k < n; k++ )
{
    *pc += *pa++ * *pb;
    pb += n;
}

Your code is probably spending more time on subscript arithmetic than anything else. My code uses +=8 and +=(n<<3), which is a lot more efficient. (Note: a double takes 8 bytes.)

Other optimizations:

If you know the value of n, you could "unroll" at least the innermost loop. This eliminates the overhead of the for.

Even if you only knew that n was even, you could iterate n/2 times, doubling up on the code in each iteration. This would cut the for overhead in half (approx).

I did not check to see if the matrix multiply could be better done in row-major versus column-major order. +=8 is faster than +=(n<<3); this would be a small improvement in the outer loops.

Another way to "unroll" would be to do two dot-products in the same inner loop. (I guess I am getting too complex to even explain.)

CPUs are "hyper-scalar" these days. This means that they can, to some extent, do multiple things at the same time. But it does not mean that things that must be done consecutively can be optimized that way. Doing two independent dot products in the same loop may provide more opportunities for hyperscaling.

DanielHsH · Answer 3 · 2016-03-12T17:36:33.190

There are a lot of ways of straight forward improvements. Basic optimization is what Rick James wrote. Furthermore you can rearrange the first matrix by rows and the second one by columns. Then in your for() loops you will always do ++ and never do +=n. Loops where you jump by n are much slower in comparison to ++.

But most of those optimizations do hold the punch because a good compiler will do them for you when you use -O3 or -O4 flags. It will unroll the loops, reuse registers, do logical operations instead of multiplications etc. It will even change the order of your for i and for j loops if necessary.

The core problem with your code is that when you have NxN matrices, you use 3 loops forcing you to do O(N^3) operations. This is very slow. I think that state of the art algorithms do only ~O(N^2.37) operations (link here). For large matrices (say N = 5000) this is a hell of a strong optimization. You can implement the Strassen algorithm easily which will give you ~N^2.87 improvement or use in combination of Karatsuba algorithm Which can speed things up even for regular scalar optimizations. Don't implement anything on your own. Download an opensource implementation. Multiplying matrices as a huge topic with a lot of research and very fast algorithms. Using 3 loops is not considered a valid way to do this work efficiently. Good luck

Compilers won't transpose your array for you. They have nowhere to store the temporary matrix. Transpose is the big win here. — Peter Cordes, Oct 03 '17 at 21:19

chqrlie · Answer 4 · 2019-05-16T21:16:17.817

Instead of optimizing, you can obfuscate the code to make it look like it is optimized.

Here is a matrix multiplication with a single null bodied for loop(!):

/* This routine performs a dgemm operation
 *  C := C + A * B
 * where A, B, and C are lda-by-lda matrices stored in column-major format.
 * On exit, A and B maintain their input values. 
 * This implementation uses a single for loop: it has been optimised for space,
 * namely vertical space in the source file! */    
void square_dgemm(int n, const double *A, const double *B, double *C) {
    for (int i = 0, j = 0, k = -1;
         ++k < n || ++j < n + (k = 0) || ++i < n + (j = 0);
         C[i+j*n] += A[i+k*n] * B[k+j*n]) {}
}

how to optimize matrix multiplication (matmul) code to run fast on a single processor core

4 Answers4

Linked