I have coded the following C function for multiplying two NxN matrices using tiling/blocking and AVX vectors to speed up the calculation. Right now though I'm getting a segmentation fault when I try to combine AVX intrinsics with tiling. Any idea why that happens?
Also, is there a better memory access pattern for matrix B? Maybe transposing it first or even changing the k and j loop? Because right now, I'm traversing it column-wise which is probably not very efficient in regards to spatial locality and cache lines.
1 void mmult(double A[SIZE_M][SIZE_N], double B[SIZE_N][SIZE_K], double C[SIZE_M][SIZE_K])
2 {
3 int i, j, k, i0, j0, k0;
4 // double sum;
5 __m256d sum;
6 for(i0 = 0; i0 < SIZE_M; i0 += BLOCKSIZE) {
7 for(k0 = 0; k0 < SIZE_N; k0 += BLOCKSIZE) {
8 for(j0 = 0; j0 < SIZE_K; j0 += BLOCKSIZE) {
9 for (i = i0; i < MIN(i0+BLOCKSIZE, SIZE_M); i++) {
10 for (j = j0; j < MIN(j0+BLOCKSIZE, SIZE_K); j++) {
11 // sum = C[i][j];
12 sum = _mm256_load_pd(&C[i][j]);
13 for (k = k0; k < MIN(k0+BLOCKSIZE, SIZE_N); k++) {
14 // sum += A[i][k] * B[k][j];
15 sum = _mm256_add_pd(sum, _mm256_mul_pd(_mm256_load_pd(&A[i][k]), _mm256_broadcast_sd(&B[k][j])));
16 }
17 // C[i][j] = sum;
18 _mm256_store_pd(&C[i][j], sum);
19 }
20 }
21 }
22 }
23 }
24 }