SSE matrix-matrix multiplication

Question

I'm having trouble doing matrix-matrix multiplication with SSE in C.

Here is what I got so far:

#define N 1000

void matmulSSE(int mat1[N][N], int mat2[N][N], int result[N][N]) {
  int i, j, k;
  __m128i vA, vB, vR;

  for(i = 0; i < N; ++i) {
    for(j = 0; j < N; ++j) {
        vR = _mm_setzero_si128();
        for(k = 0; k < N; k += 4) {
            //result[i][j] += mat1[i][k] * mat2[k][j];
            vA = _mm_loadu_si128((__m128i*)&mat1[i][k]);
            vB = _mm_loadu_si128((__m128i*)&mat2[k][j]); //how well does the k += 4 work here? Should it be unrolled?
            vR = _mm_add_epi32(vR, _mm_mul_epi32(vA, vB));
        }
        vR = _mm_hadd_epi32(vR, vR);
        vR = _mm_hadd_epi32(vR, vR);
        result[i][j] += _mm_extract_epi32(vR, 0);
    }
  }
}

I can't seem to make it give the correct results. Am I missing something? And searching dosent seem to help much - every result is either only doing 4x4 matrices, mat-vec or some special magic thats not very readable and hard to understand...

@RushyPanchal I'm not getting the correct results. Sorry, I should have specified that in my post... — Erlisch, Oct 28 '16 at 21:48
Does the caller zero `result[]` for you? If not, you should do that first! Also note that doing a horizontal sum inside the inner-most loop is horrible. If you do all the math for `result[i][j]` inside the same inner-most loop, just do `result = hsum(vR)`, not `+=`. Where hsum is a horizontal-sum function that's portable to non-MSVC (if that matters) and sucks less than what the compiler probably produces for what you wrote. See http://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-float-vector-sum-on-x86, where my answer mentions integer hsums. — Peter Cordes, Oct 29 '16 at 06:45
Yeah its zeroed. And the += was mostly for basic testing. updated with hadd - is it along the lines of what you were talking about? — Erlisch, Oct 30 '16 at 01:46
Well, `_mm_hadd_epi32` is not the most efficient way to do a horizontal sum (read the link in my last comment). But yes, that's the right idea. Why is it commented out, and why are you storing 4 copies of the result instead of just a scalar store of the horizontal sum to `result[i][j]`? — Peter Cordes, Oct 30 '16 at 02:35
Oops, was doing some tests and commented it out - guess I copied it at a bad time. And I want to get it working properly first before looking at optimizing it even more. And I guess I was hoping to calculate result for 4 at once, but I see now that that doesnt work as well as I hoped it would. However it is still not working properly...Im heavily suspecting that the problem lies with how Im using vB.... — Erlisch, Oct 30 '16 at 06:44
Oh, yeah, you're loading 4 consecutive integers, but `mat2[k+0..3][j]` aren't contiguous. Much has been written about optimizing matrix multiplies, with SIMD and with cache-blocking. I suggest you google it up. Most if it is probably talking about FP, but it all applies to integer as well. (Except that SSE/AVX only has FMA for FP, not for 32-bit integers, and the 8 and 16-bit input PMADD instructions do horizontal adds of pairs.) — Peter Cordes, Oct 30 '16 at 06:51

Peter Cordes · Accepted Answer · 2019-11-17T20:33:06.447

You're right, your vB is the problem. You're loading 4 consecutive integers, but mat2[k+0..3][j] aren't contiguous. You're actually getting mat2[k][j+0..3].

I forget what works well for matmul. Sometimes it works well to produce 4 results in parallel, instead of doing a horizontal sum for every result.

Transposing one of your input matrices works, and costs O(N^2). It's worth it because it means the O(N^3) matmul can use sequential accesses, and your current loop structure becomes SIMD-friendly.

There are even better ways, such as transposing small blocks right before use, so they're still hot in L1 cache when you read them again. Or looping over a destination row and adding in one result, instead of accumulating a full result for a single or small set of row*column dot products. Cache blocking, aka loop tiling, is one key to good matmul performance. See also What Every Programmer Should Know About Memory? which has a cache-blocked SIMD FP matmul example in an appendix without a transpose.

Much has been written about optimizing matrix multiplies, with SIMD and with cache-blocking. I suggest you google it up. Most if it is probably talking about FP, but it all applies to integer as well.

(Except that SSE/AVX only has FMA for FP, not for 32-bit integers, and the 8 and 16-bit input PMADD instructions do horizontal adds of pairs.)

Actually I think you can produce 4 results in parallel here, if one input has been transposed already:

void matmulSSE(int mat1[N][N], int mat2[N][N], int result[N][N]) {

  for(int i = 0; i < N; ++i) {
    for(int j = 0; j < N; j+=4) {   // vectorize over this loop
        __m128i vR = _mm_setzero_si128();
        for(int k = 0; k < N; k++) {   // not this loop
            //result[i][j] += mat1[i][k] * mat2[k][j];
            __m128i vA = _mm_set1_epi32(mat1[i][k]);  // load+broadcast is much cheaper than MOVD + 3 inserts (or especially 4x insert, which your new code is doing)
            __m128i vB = _mm_loadu_si128((__m128i*)&mat2[k][j]);  // mat2[k][j+0..3]
            vR = _mm_add_epi32(vR, _mm_mullo_epi32(vA, vB));
        }
        _mm_storeu_si128((__m128i*)&result[i][j], vR));
    }
  }
}

A broadcast-load (or separate load+broadcast without AVX) is still much cheaper than a gather.

Your current code does the gather with 4 inserts, instead of breaking the dependency chain on the previous iteration's value by using a MOVD for the first element, so that's even worse. But even the best gather of 4 scattered elements is pretty bad compared to a load + PUNPCKLDQ. Not to mention that that makes your code need SSE4.1.

Although it needs SSE4.1 anyway for _mm_mullo_epi32 instead of the widening PMULDQ (_mm_mul_epi32).

Note that integer multiply throughput is generally worse than FP multiply, especially on Haswell and later. FP FMA units only have 24-bit wide multipliers per 32-bit element (for FP mantissas) so using those for 32x32=>32-bit integer requires splitting into two uops.

Yes, this was a big error. Along with this I also found that the multiplication in SSE didnt work as I thought it did - see my edit. Thanks for all the help. :) — Erlisch, Oct 30 '16 at 14:07
@Erlisch: holy crap, now you have 2 PHADDD instructions inside the inner loop, and a scalar `+=`. Did you benchmark that against scalar code? It's probably slower. — Peter Cordes, Oct 31 '16 at 01:34
@Erlisch: I just noticed that we can produce results for `j+0..3` in parallel, and that's a LOT more efficient than vectorizing over `k` with a gather. Updated my answer. — Peter Cordes, Oct 31 '16 at 01:47
Yes it was alot slower, but as I said I was making a working example first. Your version cut the time in more than half. It also looks like it could easily be changed to i-k-j loop order to get the vectorized loop innermost. — Erlisch, Oct 31 '16 at 11:33
@Erlisch: Yes, but then instead of accumulating a result in a register, you'd have to load/add/store to `&result[i][j]` inside the inner loop. It's probably better to simply produce 4 results in parallel from the inner loop. You could maybe improve your cache utilization by doing 4 vectors of `j` values in parallel, to get more reuse of the broadcast-load, and so the inner loop uses all 64B of every cache line it touches in mat2 (assuming mat2 is 64B-aligned). The current version uses only 16B of every cache line, with a strided access pattern. — Peter Cordes, Oct 31 '16 at 11:49
I'm having deja-vu about doing 4 vectors in parallel to get more reuse of the broadcast-load; I'm pretty sure I've come up with this idea before for the same problem, or at least something extremely similar. — Peter Cordes, Oct 31 '16 at 11:50

Peter Cordes · Answer 2 · 2020-04-06T08:52:54.840

This first version was posted by the OP as an edit to the question where it doesn't belong. Moved it to a community-wiki answer just for posterity.

That first version is absolute garbage for performance, the worst possible way to vectorize, doing an hsum down to scalar inside the inner-most loop, and doing a manual gather with insert_epi32 not even a 4x4 transpose.

Update: Woho! I finally figured it out. Besides the errors in my logic (thanks for the help Peter Cordes) there was also the issue of _mm_mul_epi32() not working as I thought it did - I should've been using _mm_mullo_epi32() instead!

I know this is not the most effective code, but it was made to get it to work properly first - now I can move on to optimizing it.

(Note, don't use this, it's very very slow)

// editor's note: this is the most naive and horrible way to vectorize
void matmulSSE_inefficient(int mat1[N][N], int mat2[N][N], int result[N][N]) {
    int i, j, k;
    __m128i vA, vB, vR, vSum;

    for(i = 0; i < N; ++i) {
        for(j = 0; j < N; ++j) {
            vR = _mm_setzero_si128();
            for(k = 0; k < N; k += 4) {
                //result[i][j] += mat1[i][k] * mat2[k][j];
                vA = _mm_loadu_si128((__m128i*)&mat1[i][k]);
                          // less braindead would be to start vB with movd, avoiding a false dep and one shuffle uop.
                          // vB = _mm_cvtsi32_si128(mat2[k][j]);   // but this manual gather is still very bad
                vB = _mm_insert_epi32(vB, mat2[k][j], 0);     // false dependency on old vB
                vB = _mm_insert_epi32(vB, mat2[k + 1][j], 1);  // bad spatial locality
                vB = _mm_insert_epi32(vB, mat2[k + 2][j], 2);  // striding down a column
                vB = _mm_insert_epi32(vB, mat2[k + 3][j], 3);
                vR = _mm_mullo_epi32(vA, vB);
                vR = _mm_hadd_epi32(vR, vR);                // very slow inside the inner loop
                vR = _mm_hadd_epi32(vR, vR);
                result[i][j] += _mm_extract_epi32(vR, 0);

                //DEBUG
                //printf("vA: %d, %d, %d, %d\n", vA.m128i_i32[0], vA.m128i_i32[1], vA.m128i_i32[2], vA.m128i_i32[3]);
                //printf("vB: %d, %d, %d, %d\n", vB.m128i_i32[0], vB.m128i_i32[1], vB.m128i_i32[2], vB.m128i_i32[3]);
                //printf("vR: %d, %d, %d, %d\n", vR.m128i_i32[0], vR.m128i_i32[1], vR.m128i_i32[2], vR.m128i_i32[3]);
                //printf("\n");
            }
        }
    }
}

End of extremely inefficient code originally written by the OP

Update 2: converted the OP's example to an i-k-j loop order version. Required an extra load for vR and moving the store into the inner loop, but setting vA could be moved up a loop. Turned out faster.

// this is significantly better but doesn't do any cache-blocking
void matmulSSE_v2(int mat1[N][N], int mat2[N][N], int result[N][N]) {
    int i, j, k;
    __m128i vA, vB, vR;

    for(i = 0; i < N; ++i) {
        for(k = 0; k < N; ++k) {
            vA = _mm_set1_epi32(mat1[i][k]);
            for(j = 0; j < N; j += 4) {
                //result[i][j] += mat1[i][k] * mat2[k][j];
                vB = _mm_loadu_si128((__m128i*)&mat2[k][j]);
                vR = _mm_loadu_si128((__m128i*)&result[i][j]);
                vR = _mm_add_epi32(vR, _mm_mullo_epi32(vA, vB));
                _mm_storeu_si128((__m128i*)&result[i][j], vR);

                //DEBUG
                //printf("vA: %d, %d, %d, %d\n", vA.m128i_i32[0], vA.m128i_i32[1], vA.m128i_i32[2], vA.m128i_i32[3]);
                //printf("vB: %d, %d, %d, %d\n", vB.m128i_i32[0], vB.m128i_i32[1], vB.m128i_i32[2], vB.m128i_i32[3]);
                //printf("vR: %d, %d, %d, %d\n", vR.m128i_i32[0], vR.m128i_i32[1], vR.m128i_i32[2], vR.m128i_i32[3]);

                //printf("\n");
            }
        }
    }
}

These assume N is a multiple of 4, the vector width

If that's not the case, often it's easier to still pad your array storage to a multiple of the vector width, so there's padding at the end of each row and you can just use that simple j < N; j += 4 loop condition. You'll want to keep track of the real N size separately from the storage layout with a row stride that's a multiple of 4 or 8.

Otherwise you want a loop condition like j < N-3; j += 4`, and a scalar cleanup for the end of a row.

Or masking or keeping the last full vector in a register so you can _mm_alignr_epi8 with a maybe-overlapping final vector that ends at the end of the row, and maybe do a vector store. This is easier with AVX or especially AVX512 masking.

You code is correct but will work only for dimensions which are multiple of 4.Can the code be made to work for any dimensions? — rojan sudev, Apr 06 '20 at 08:34
@rojansudev: yes, you can do a scalar cleanup for the last `N & 3` elements of a row, if it's not a multiple of 4. Or better pad your rows. Updated the answer with some hints. — Peter Cordes, Apr 06 '20 at 08:54

SSE matrix-matrix multiplication

2 Answers2

Linked