Speed up matrix-matrix multiplication using SSE vector instructions

Question

I have some trouble in vectorize some C code using SSE vector instructions. The code which I have to victorize is

#define N 1000
void matrix_mul(int mat1[N][N], int mat2[N][N], int result[N][N])
{
   int i, j, k;
   for (i = 0; i < N; ++i)
   {
      for (j = 0; j < N; ++j)
      {
         for (k = 0; k < N; ++k)
         {
              result[i][k] += mat1[i][j] * mat2[j][k];
         }
      }
   }
}

Here is what I got so far:

void  matrix_mul_sse(int mat1[N][N], int mat2[N][N], int result[N][N])
{
   int i, j, k; int* l;
   __m128i v1, v2, v3;
   v3 = _mm_setzero_si128();
   for (i = 0; i < N; ++i)
   {
       for (j = 0; j < N; j += 4)
       {

           for (k = 0; k < N; k += 4)
           {

               v1 = _mm_set1_epi32(mat1[i][j]);
               v2 = _mm_loadu_si128((__m128i*)&mat2[j][k]);
               v3 = _mm_add_epi32(v3, _mm_mul_epi32(v1, v2));
               _mm_storeu_si128((__m128i*)&result[i][k], v3);
               v3 = _mm_setzero_si128();
           }
       }
   }
}

After execution I got wrong result. I know that the reason is the loading from memory to v2. I loop through mat1 in row major order so I need to load mat2[0][0], mat2[1][0], mat2[2][0], mat2[3][0].... but what actually loaded is mat2[0][0], mat2[0][1], mat2[0][2], mat2[0][3]... because mat2 has stored in the memory in row major order. I tried to fix this problem but without any improvement. Can anyone help me please.

I wrote a bit about this on codereview last year: https://codereview.stackexchange.com/a/177751/36018 — harold, Oct 30 '18 at 17:56

score 4 · Answer 1 · edited Jun 20 '20 at 09:12

4

Below fixed your implementation:

void  matrix_mul_sse(int mat1[N][N], int mat2[N][N], int result[N][N])
{
   int i, j, k;
   __m128i v1, v2, v3, v4; 
   for (i = 0; i < N; ++i)
   {
       for (j = 0; j < N; ++j) // 'j' must be incremented by 1
       {
           // read mat1 here because it does not use 'k' index
           v1 = _mm_set1_epi32(mat1[i][j]); 
           for (k = 0; k < N; k += 4)
           {   
               v2 = _mm_loadu_si128((const __m128i*)&mat2[j][k]);

               // read what's in the result array first as we will need to add it later to our calculations
               v3 = _mm_loadu_si128((const __m128i*)&result[i][k]);

               // use _mm_mullo_epi32 here instead _mm_mul_epi32 and add it to the previous result
               v4 = _mm_add_epi32(v3, _mm_mullo_epi32(v1, v2));

               // store the result
               _mm_storeu_si128((__m128i*)&result[i][k], v4);
           }
       }
   }
}

In short _mm_mullo_epi32 (requires SSE4.1) produces 4 x int32 results as opposed to _mm_mul_epi32 which does 2 x int64 results. If you cannot use SSE4.1 then have a look at the answer here for an alternative SSE2 solution.

Full description by Intel Intrinsic Guide:

_mm_mullo_epi32: Multiply the packed 32-bit integers in a and b, producing intermediate 64-bit integers, and store the low 32 bits of the intermediate integers in dst.

_mm_mul_epi32: Multiply the low 32-bit integers from each packed 64-bit element in a and b, and store the signed 64-bit results in dst.

edited Jun 20 '20 at 09:12

Community

1
1

answered Oct 30 '18 at 14:14

doqtor

8,414
2
20
36

Looks like a good solution, but could use a little commentary to explain what needed fixing to make it a really useful answer. – Paul R Oct 30 '18 at 14:32
3

@PaulR added, hope that's sufficient :) – doqtor Oct 30 '18 at 15:07
Did you try this? It doesn’t generate anything like the correct answer. I think the reason is that you can’t use &mat2[j][k] because you need the columnar vector of the matrix { (j,k), (j+1,k), (j+2, k), (j+3, k) }. – mevets Oct 30 '18 at 15:40
1

@mevets yes I did and it gives exactly the same results as `matrix_mul` which OP wanted to vectorize. – doqtor Oct 30 '18 at 15:53
2

Let me be more precise... output result from fixed vectorized implementation in my answer is exactly the same as OP's `matrix_mul` output. – doqtor Oct 30 '18 at 16:08
Thank you so much this explanation was so helpful :) @doqtor – Majd.A.A Oct 30 '18 at 18:08
1

Your version assumes N is a multiple of 4, but it looks like the @Majd.A.A was doing the same thing and hadn't added code to handle the last few elements yet. And it assumes the `result` is already zeroed, just like `matrix_mul`. – Peter Cordes Oct 31 '18 at 03:56

score 0 · Answer 2 · answered Oct 30 '18 at 15:53

I kinda changed around your code to make the addressing explicit [ it helps in this case ].

#define N 100

This is a stub for the vector unit multiple & accumulate operation; you should be able to replace NV with whatever throw your vector unit has, and put the relevant opcodes in here.

#define NV 8
int Vmacc(int *A, int *B) {
   int i = 0;
   int x = 0;
   for (i = 0; i < NV; i++) {
        x += *A++ * *B++;
    }
    return x;
}

This multiply has two notable variations from the norm: 1. It caches the columnar vector into a contiguous one. 2. It attempts to push slices of the multiply accumulate into a vector-like func. Even without using the vector unit, this takes half the time of naive version just because of better cache/prefetch utilization.

void mm2(int *A, int *B, int n, int *C) {
    int c, r;
    int stride = 0;
    int cache[N];
    for (c = 0; c < n; c++) {
        /* cache cumn i: */
        for (r = 0; r < n; r++) {
            cache[r] = B[c + r*n];
        }
        for (r = 0; r < n; r++) {
            int k = 0;
            int x = 0;
            int *Av = A + r*n;
            for (k = 0; k+NV-1 < n; k += NV) {
                x += Vmacc(Av+k, cache+k);
            }
            while (k < n) {
                x += Av[k] * cache[k];
                k++;
            }
            C[r*n + c] = x;
        }
    }
}

`Vmacc` is not particularly SIMD-friendly. You want to accumulate a SIMD *vector* of sums and horizontal sum at the end, not horizontal-sum inside the inner-most loop. The OP is already working around the transpose / strided-access problem by adding into `result[i][k]` for a whole vector of elements in parallel, instead of doing one column * row dot-product at a time. That approach has its limitations, too, but with the limited throughput of SIMD integer multiply (vs. SIMD FP FMA) it should be fine. Some cache-blocking could help, though. — Peter Cordes, Oct 31 '18 at 04:03

Speed up matrix-matrix multiplication using SSE vector instructions

2 Answers2