I want to speed up a matrix multiply algorithm. I am trying to use the Intel SIMD functions, but I am finding that I don't quite understand what they do.
For context, I transpose the matrix b before computation, which should allow me to multiply four values like,
A[i][k] * B[j+0][k]
A[i][k] * B[j+1][k]
A[i][k] * B[j+2][k]
A[i][k] * B[j+3][k]
with each iteration and going into C[i][j] through C[i][j+3]. This, in theory, should increase cache efficency in using blocks from the cache. The matrix multiply part of my code is as follows.
for(i = 0; i < n; i++)
{
for(j = 0; j < n; j++)
{
vc = _mm256_load_pd(&c[i][j]);
for(k = 0; k < n; k+=VECTOR_WIDTH)
{
va = _mm256_load_pd(&a[ii][kk]);
vb_0 = _mm256_load_pd(&b[jj][kk]);
vb_1 = _mm256_load_pd(&b[jj+1][kk]);
vb_2 = _mm256_load_pd(&b[jj+2][kk]);
vb_3 = _mm256_load_pd(&b[jj+3][kk]);
vc_0 = _mm256_mul_pd(va,vb_0)
vc_1 = _mm256_mul_pd(va,vb_1)
vc_2 = _mm256_mul_pd(va,vb_2)
vc_3 = _mm256_mul_pd(va,vb_3)
}
_mm256_store_pd(&c[i][j],vc);
}
}
i,j,k are integers. a,b,c are double**. va,vb_i,vc_i are all __m256d. All matrices are n x n. I hope this is enough information, but please reply if I need to put more information.
To my understanding, each one of the loads puts elements m and n through m and n+3 into a vector; however, each one of the vc_0,vc_1,vc_2, and vc_3 all hold 4 elements that added together need to be added into only one element of matrix C. What is any way to get those four elements into matrix C and what is the most efficient way to do so? Any and all help will be appreciated!