0

I want to speed up a matrix multiply algorithm. I am trying to use the Intel SIMD functions, but I am finding that I don't quite understand what they do.

For context, I transpose the matrix b before computation, which should allow me to multiply four values like,

A[i][k] * B[j+0][k]
A[i][k] * B[j+1][k]
A[i][k] * B[j+2][k]
A[i][k] * B[j+3][k]

with each iteration and going into C[i][j] through C[i][j+3]. This, in theory, should increase cache efficency in using blocks from the cache. The matrix multiply part of my code is as follows.

  for(i = 0; i < n; i++)
  {
     for(j = 0; j < n; j++)
     {
        vc = _mm256_load_pd(&c[i][j]);
        for(k = 0; k < n; k+=VECTOR_WIDTH)
        {
           va = _mm256_load_pd(&a[ii][kk]);

           vb_0 = _mm256_load_pd(&b[jj][kk]);
           vb_1 = _mm256_load_pd(&b[jj+1][kk]);
           vb_2 = _mm256_load_pd(&b[jj+2][kk]);
           vb_3 = _mm256_load_pd(&b[jj+3][kk]);

           vc_0 = _mm256_mul_pd(va,vb_0)
           vc_1 = _mm256_mul_pd(va,vb_1)
           vc_2 = _mm256_mul_pd(va,vb_2)
           vc_3 = _mm256_mul_pd(va,vb_3)

        }
        _mm256_store_pd(&c[i][j],vc);
    }
  }

i,j,k are integers. a,b,c are double**. va,vb_i,vc_i are all __m256d. All matrices are n x n. I hope this is enough information, but please reply if I need to put more information.

To my understanding, each one of the loads puts elements m and n through m and n+3 into a vector; however, each one of the vc_0,vc_1,vc_2, and vc_3 all hold 4 elements that added together need to be added into only one element of matrix C. What is any way to get those four elements into matrix C and what is the most efficient way to do so? Any and all help will be appreciated!

  • Hi, How `vc` and `vc_0, vc_1, vc_2, vc_3` are related? – Mathieu Dec 07 '20 at 09:06
  • Could you post a [MCVE]: a complete C file (you can add some function to hard code the input matrix values and the command to build and run? – Mathieu Dec 07 '20 at 09:06
  • 2
    Use `_mm256_loadu_pd` rather than `_mm256_load_pd` as your source addreses are not guaranteed to be suitably aligned. (Ditto for the stores.) – Paul R Dec 07 '20 at 09:43
  • 1
    `double**` - yuck, don't do that . Allocate [proper 2D arrays](https://stackoverflow.com/questions/17389009/a-different-way-to-malloc-a-2d-array) or do manual 2D indexing in a 1D array. And align them by 128 if you want to make sure you're actually dealing with a pair of cache lines; that will also fix your segfaults from using alignment-required loads on unaligned memory. ([Segmentation fault when trying to use intrinsics specifically \_mm256\_storeu\_pd()](https://stackoverflow.com/q/39516528)) – Peter Cordes Dec 07 '20 at 13:28

0 Answers0