matrix product efficiency comparison

Question

While I was playing with different ways of doing matrix multiplications and measuring the execution times I got some questions about those results.

Matrices: float A[N][N], B[N][N], C[N][N];

By matrix read order:

IJK:

for (i=0; i<N; i++)    
  for (j=0; j<N; j++)
    for (k=0; k<N; k++) 
      C[i][j] = C[i][j] + A[i][k] * B[k][j];

JKI:

for (j=0; j<N; j++)
  for (k=0; k<N; k++) 
    for (i=0; i<N; i++) 
      C[i][j] = C[i][j] + A[i][k] * B[k][j];

KIJ:

for (k=0; k<N; k++) 
  for (i=0; i<N; i++) 
    for (j=0; j<N; j++)
      C[i][j] = C[i][j] + A[i][k] * B[k][j];

With auxiliar:

IJK:

for (i=0; i<N; i++) {
  for (j=0; j<N; j++) {
    aux = C[i][j];
    for (k=0; k<N; k++) {
      aux = aux + A[i][k] * B[k][j];
    }
    C[i][j] = aux;
  }
}

JKI:

for (j=0; j<N; j++) {
  for (k=0; k<N; k++) {
    aux = B[k][j];
    for (i=0; i<N; i++) {
      C[i][j] = C[i][j] + A[i][k] * aux;
    }
  }
}

KIJ:

for (k=0; k<N; k++) {
  for (i=0; i<N; i++) {
    aux = A[i][k];
    for (j=0; j<N; j++) {
      C[i][j] = C[i][j] + aux * B[k][j];
    }
  }
}

My Execution times:

What goes on my head:

What stands out more is the increase in time with 'JKI' over size 512. I guess it's because the matrices can't fit in cache over that size and the problem with 'JKI' is that all accesses are from top to bot and left to right instead of left to right and top to bot like with C and A from 'IJK' and C and B from 'KIJ'. If it were the other way around whenever a block was loaded in cache next accesses would be hits and that's why 'IJK' and 'KIJ' are faster. Correct me if i'm wrong.

The thing I can't get over is why 'KIJ' is faster than 'IJK'. Should not be the same?[q1]

Why with an auxiliar in 'JKI', supposedly reducing accesses to memory has the same execution time than normal 'JKI'?[q2]

And last, why with '-O2' flag the execution time is so drastically reduced in 'KIJ' but practically unchanged in 'JKI'?[q3] I also would like to know what really does -O2 to optimize.

Final conclusion: 'KIJ' with auxiliar is the way to go from those 6 examples. I would love to know if there is a faster algorithm.[q4]

You might want to ask fewer questions. As it is now, your questions are all over the place. Not sure if it would deter people from answering, but it might. Many heavy topics, like caching, aliasing, and disassembly, may be involved. Also, algorithmic optimization of matrix multiplication is a separate additional topic. — anatolyg, Aug 13 '18 at 00:01
Which optimization level were you using when you didn't use `-O2`? If it was 'no optimization flag' (or `-O0`) the measurements are essentially irrelevant; the performance of unoptimized code is unoptimized and uninteresting. If it was `-O` or `-O1` or `-O3` (or `-Ofast` or `-Ospace` or …), you should say so. Why didn't you show the performance 'with auxiliar' for "1024 with `-O2`"? — Jonathan Leffler, Aug 13 '18 at 01:07
With the IJK order, you are accessing C and A in row order but B in column order. With JKI, you are accessing C and A in column order but B in row order. With KIJ, you are accessing C and B in row order but A in column order. The C accesses are generally least critical; the compiler probably optimizes those to the equivalent of using `aux`. You can't easily avoid accessing one of A or B in column order while accessing the other in row order unless you create a matrix D that is the transpose of one (say B), and then process A and D both in row order (and you should access C in row order too). — Jonathan Leffler, Aug 13 '18 at 01:14
@anatolyg The cases where I didn't use -O2 I was using -O0, sorry for not specifying. I just wanted to know what was the base performance but I can understant why those would be irrelevant. — Joan Pastor, Aug 13 '18 at 02:11
@anatolyg Also thanks for clarifying about the compiler optimitzation, makes sense. And why you say B in 'JKI' is accessed in row order? — Joan Pastor, Aug 13 '18 at 02:15
matrix multiplication like that will not be cache-friendly since you're using row-based accessing. Use some better solution like [Cache friendly method to multiply two matrices](https://stackoverflow.com/q/13312625/995714), [How to optimize matrix multiplication operation](https://stackoverflow.com/q/6061921/995714), [Optimized matrix multiplication in C](https://stackoverflow.com/q/1907557/995714), [What is the best matrix multiplication algorithm?](https://stackoverflow.com/q/4455645/995714), [How to speed up matrix multiplication in C++?](https://stackoverflow.com/q/4300663/995714) — phuclv, Aug 13 '18 at 05:19
Also: https://codereview.stackexchange.com/a/193734/36018 You need *at least* some tiling to make matrix multiplication not slow, then you can go further at the micro level to make it actually fast. — harold, Aug 13 '18 at 09:03

matrix product efficiency comparison

By matrix read order:

With auxiliar:

My Execution times:

What goes on my head:

0 Answers0