While I was playing with different ways of doing matrix multiplications and measuring the execution times I got some questions about those results.
Matrices: float A[N][N], B[N][N], C[N][N];
By matrix read order:
IJK:
for (i=0; i<N; i++) for (j=0; j<N; j++) for (k=0; k<N; k++) C[i][j] = C[i][j] + A[i][k] * B[k][j];
JKI:
for (j=0; j<N; j++) for (k=0; k<N; k++) for (i=0; i<N; i++) C[i][j] = C[i][j] + A[i][k] * B[k][j];
KIJ:
for (k=0; k<N; k++) for (i=0; i<N; i++) for (j=0; j<N; j++) C[i][j] = C[i][j] + A[i][k] * B[k][j];
With auxiliar:
IJK:
for (i=0; i<N; i++) { for (j=0; j<N; j++) { aux = C[i][j]; for (k=0; k<N; k++) { aux = aux + A[i][k] * B[k][j]; } C[i][j] = aux; } }
JKI:
for (j=0; j<N; j++) { for (k=0; k<N; k++) { aux = B[k][j]; for (i=0; i<N; i++) { C[i][j] = C[i][j] + A[i][k] * aux; } } }
KIJ:
for (k=0; k<N; k++) { for (i=0; i<N; i++) { aux = A[i][k]; for (j=0; j<N; j++) { C[i][j] = C[i][j] + aux * B[k][j]; } } }
My Execution times:
What goes on my head:
What stands out more is the increase in time with 'JKI' over size 512. I guess it's because the matrices can't fit in cache over that size and the problem with 'JKI' is that all accesses are from top to bot and left to right instead of left to right and top to bot like with C and A from 'IJK' and C and B from 'KIJ'. If it were the other way around whenever a block was loaded in cache next accesses would be hits and that's why 'IJK' and 'KIJ' are faster. Correct me if i'm wrong.
The thing I can't get over is why 'KIJ' is faster than 'IJK'. Should not be the same?[q1]
Why with an auxiliar in 'JKI', supposedly reducing accesses to memory has the same execution time than normal 'JKI'?[q2]
And last, why with '-O2' flag the execution time is so drastically reduced in 'KIJ' but practically unchanged in 'JKI'?[q3] I also would like to know what really does -O2 to optimize.
Final conclusion: 'KIJ' with auxiliar is the way to go from those 6 examples. I would love to know if there is a faster algorithm.[q4]