3

I'm having trouble doing matrix-matrix multiplication with SSE in C.

Here is what I got so far:

#define N 1000

void matmulSSE(int mat1[N][N], int mat2[N][N], int result[N][N]) {
  int i, j, k;
  __m128i vA, vB, vR;

  for(i = 0; i < N; ++i) {
    for(j = 0; j < N; ++j) {
        vR = _mm_setzero_si128();
        for(k = 0; k < N; k += 4) {
            //result[i][j] += mat1[i][k] * mat2[k][j];
            vA = _mm_loadu_si128((__m128i*)&mat1[i][k]);
            vB = _mm_loadu_si128((__m128i*)&mat2[k][j]); //how well does the k += 4 work here? Should it be unrolled?
            vR = _mm_add_epi32(vR, _mm_mul_epi32(vA, vB));
        }
        vR = _mm_hadd_epi32(vR, vR);
        vR = _mm_hadd_epi32(vR, vR);
        result[i][j] += _mm_extract_epi32(vR, 0);
    }
  }
}

I can't seem to make it give the correct results. Am I missing something? And searching dosent seem to help much - every result is either only doing 4x4 matrices, mat-vec or some special magic thats not very readable and hard to understand...

S.S. Anne
  • 15,171
  • 8
  • 38
  • 76
Erlisch
  • 43
  • 1
  • 4
  • 1
    What is the trouble you are having? – Rushy Panchal Oct 28 '16 at 21:30
  • @RushyPanchal I'm not getting the correct results. Sorry, I should have specified that in my post... – Erlisch Oct 28 '16 at 21:48
  • 1
    Does the caller zero `result[]` for you? If not, you should do that first! Also note that doing a horizontal sum inside the inner-most loop is horrible. If you do all the math for `result[i][j]` inside the same inner-most loop, just do `result = hsum(vR)`, not `+=`. Where hsum is a horizontal-sum function that's portable to non-MSVC (if that matters) and sucks less than what the compiler probably produces for what you wrote. See http://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-float-vector-sum-on-x86, where my answer mentions integer hsums. – Peter Cordes Oct 29 '16 at 06:45
  • Yeah its zeroed. And the += was mostly for basic testing. updated with hadd - is it along the lines of what you were talking about? – Erlisch Oct 30 '16 at 01:46
  • Well, `_mm_hadd_epi32` is not the most efficient way to do a horizontal sum (read the link in my last comment). But yes, that's the right idea. Why is it commented out, and why are you storing 4 copies of the result instead of just a scalar store of the horizontal sum to `result[i][j]`? – Peter Cordes Oct 30 '16 at 02:35
  • Oops, was doing some tests and commented it out - guess I copied it at a bad time. And I want to get it working properly first before looking at optimizing it even more. And I guess I was hoping to calculate result for 4 at once, but I see now that that doesnt work as well as I hoped it would. However it is still not working properly...Im heavily suspecting that the problem lies with how Im using vB.... – Erlisch Oct 30 '16 at 06:44
  • Oh, yeah, you're loading 4 consecutive integers, but `mat2[k+0..3][j]` aren't contiguous. Much has been written about optimizing matrix multiplies, with SIMD and with cache-blocking. I suggest you google it up. Most if it is probably talking about FP, but it all applies to integer as well. (Except that SSE/AVX only has FMA for FP, not for 32-bit integers, and the 8 and 16-bit input PMADD instructions do horizontal adds of pairs.) – Peter Cordes Oct 30 '16 at 06:51

2 Answers2

3

You're right, your vB is the problem. You're loading 4 consecutive integers, but mat2[k+0..3][j] aren't contiguous. You're actually getting mat2[k][j+0..3].


I forget what works well for matmul. Sometimes it works well to produce 4 results in parallel, instead of doing a horizontal sum for every result.

Transposing one of your input matrices works, and costs O(N^2). It's worth it because it means the O(N^3) matmul can use sequential accesses, and your current loop structure becomes SIMD-friendly.

There are even better ways, such as transposing small blocks right before use, so they're still hot in L1 cache when you read them again. Or looping over a destination row and adding in one result, instead of accumulating a full result for a single or small set of row*column dot products. Cache blocking, aka loop tiling, is one key to good matmul performance. See also What Every Programmer Should Know About Memory? which has a cache-blocked SIMD FP matmul example in an appendix without a transpose.

Much has been written about optimizing matrix multiplies, with SIMD and with cache-blocking. I suggest you google it up. Most if it is probably talking about FP, but it all applies to integer as well.

(Except that SSE/AVX only has FMA for FP, not for 32-bit integers, and the 8 and 16-bit input PMADD instructions do horizontal adds of pairs.)


Actually I think you can produce 4 results in parallel here, if one input has been transposed already:

void matmulSSE(int mat1[N][N], int mat2[N][N], int result[N][N]) {

  for(int i = 0; i < N; ++i) {
    for(int j = 0; j < N; j+=4) {   // vectorize over this loop
        __m128i vR = _mm_setzero_si128();
        for(int k = 0; k < N; k++) {   // not this loop
            //result[i][j] += mat1[i][k] * mat2[k][j];
            __m128i vA = _mm_set1_epi32(mat1[i][k]);  // load+broadcast is much cheaper than MOVD + 3 inserts (or especially 4x insert, which your new code is doing)
            __m128i vB = _mm_loadu_si128((__m128i*)&mat2[k][j]);  // mat2[k][j+0..3]
            vR = _mm_add_epi32(vR, _mm_mullo_epi32(vA, vB));
        }
        _mm_storeu_si128((__m128i*)&result[i][j], vR));
    }
  }
}

A broadcast-load (or separate load+broadcast without AVX) is still much cheaper than a gather.

Your current code does the gather with 4 inserts, instead of breaking the dependency chain on the previous iteration's value by using a MOVD for the first element, so that's even worse. But even the best gather of 4 scattered elements is pretty bad compared to a load + PUNPCKLDQ. Not to mention that that makes your code need SSE4.1.

Although it needs SSE4.1 anyway for _mm_mullo_epi32 instead of the widening PMULDQ (_mm_mul_epi32).

Note that integer multiply throughput is generally worse than FP multiply, especially on Haswell and later. FP FMA units only have 24-bit wide multipliers per 32-bit element (for FP mantissas) so using those for 32x32=>32-bit integer requires splitting into two uops.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Yes, this was a big error. Along with this I also found that the multiplication in SSE didnt work as I thought it did - see my edit. Thanks for all the help. :) – Erlisch Oct 30 '16 at 14:07
  • @Erlisch: holy crap, now you have 2 PHADDD instructions inside the inner loop, and a scalar `+=`. Did you benchmark that against scalar code? It's probably slower. – Peter Cordes Oct 31 '16 at 01:34
  • @Erlisch: I just noticed that we can produce results for `j+0..3` in parallel, and that's a LOT more efficient than vectorizing over `k` with a gather. Updated my answer. – Peter Cordes Oct 31 '16 at 01:47
  • Yes it was alot slower, but as I said I was making a working example first. Your version cut the time in more than half. It also looks like it could easily be changed to i-k-j loop order to get the vectorized loop innermost. – Erlisch Oct 31 '16 at 11:33
  • @Erlisch: Yes, but then instead of accumulating a result in a register, you'd have to load/add/store to `&result[i][j]` inside the inner loop. It's probably better to simply produce 4 results in parallel from the inner loop. You could maybe improve your cache utilization by doing 4 vectors of `j` values in parallel, to get more reuse of the broadcast-load, and so the inner loop uses all 64B of every cache line it touches in mat2 (assuming mat2 is 64B-aligned). The current version uses only 16B of every cache line, with a strided access pattern. – Peter Cordes Oct 31 '16 at 11:49
  • I'm having deja-vu about doing 4 vectors in parallel to get more reuse of the broadcast-load; I'm pretty sure I've come up with this idea before for the same problem, or at least something extremely similar. – Peter Cordes Oct 31 '16 at 11:50
0

This first version was posted by the OP as an edit to the question where it doesn't belong. Moved it to a community-wiki answer just for posterity.

That first version is absolute garbage for performance, the worst possible way to vectorize, doing an hsum down to scalar inside the inner-most loop, and doing a manual gather with insert_epi32 not even a 4x4 transpose.


Update: Woho! I finally figured it out. Besides the errors in my logic (thanks for the help Peter Cordes) there was also the issue of _mm_mul_epi32() not working as I thought it did - I should've been using _mm_mullo_epi32() instead!

I know this is not the most effective code, but it was made to get it to work properly first - now I can move on to optimizing it.

(Note, don't use this, it's very very slow)

// editor's note: this is the most naive and horrible way to vectorize
void matmulSSE_inefficient(int mat1[N][N], int mat2[N][N], int result[N][N]) {
    int i, j, k;
    __m128i vA, vB, vR, vSum;

    for(i = 0; i < N; ++i) {
        for(j = 0; j < N; ++j) {
            vR = _mm_setzero_si128();
            for(k = 0; k < N; k += 4) {
                //result[i][j] += mat1[i][k] * mat2[k][j];
                vA = _mm_loadu_si128((__m128i*)&mat1[i][k]);
                          // less braindead would be to start vB with movd, avoiding a false dep and one shuffle uop.
                          // vB = _mm_cvtsi32_si128(mat2[k][j]);   // but this manual gather is still very bad
                vB = _mm_insert_epi32(vB, mat2[k][j], 0);     // false dependency on old vB
                vB = _mm_insert_epi32(vB, mat2[k + 1][j], 1);  // bad spatial locality
                vB = _mm_insert_epi32(vB, mat2[k + 2][j], 2);  // striding down a column
                vB = _mm_insert_epi32(vB, mat2[k + 3][j], 3);
                vR = _mm_mullo_epi32(vA, vB);
                vR = _mm_hadd_epi32(vR, vR);                // very slow inside the inner loop
                vR = _mm_hadd_epi32(vR, vR);
                result[i][j] += _mm_extract_epi32(vR, 0);

                //DEBUG
                //printf("vA: %d, %d, %d, %d\n", vA.m128i_i32[0], vA.m128i_i32[1], vA.m128i_i32[2], vA.m128i_i32[3]);
                //printf("vB: %d, %d, %d, %d\n", vB.m128i_i32[0], vB.m128i_i32[1], vB.m128i_i32[2], vB.m128i_i32[3]);
                //printf("vR: %d, %d, %d, %d\n", vR.m128i_i32[0], vR.m128i_i32[1], vR.m128i_i32[2], vR.m128i_i32[3]);
                //printf("\n");
            }
        }
    }
}

End of extremely inefficient code originally written by the OP


Update 2: converted the OP's example to an i-k-j loop order version. Required an extra load for vR and moving the store into the inner loop, but setting vA could be moved up a loop. Turned out faster.

// this is significantly better but doesn't do any cache-blocking
void matmulSSE_v2(int mat1[N][N], int mat2[N][N], int result[N][N]) {
    int i, j, k;
    __m128i vA, vB, vR;

    for(i = 0; i < N; ++i) {
        for(k = 0; k < N; ++k) {
            vA = _mm_set1_epi32(mat1[i][k]);
            for(j = 0; j < N; j += 4) {
                //result[i][j] += mat1[i][k] * mat2[k][j];
                vB = _mm_loadu_si128((__m128i*)&mat2[k][j]);
                vR = _mm_loadu_si128((__m128i*)&result[i][j]);
                vR = _mm_add_epi32(vR, _mm_mullo_epi32(vA, vB));
                _mm_storeu_si128((__m128i*)&result[i][j], vR);

                //DEBUG
                //printf("vA: %d, %d, %d, %d\n", vA.m128i_i32[0], vA.m128i_i32[1], vA.m128i_i32[2], vA.m128i_i32[3]);
                //printf("vB: %d, %d, %d, %d\n", vB.m128i_i32[0], vB.m128i_i32[1], vB.m128i_i32[2], vB.m128i_i32[3]);
                //printf("vR: %d, %d, %d, %d\n", vR.m128i_i32[0], vR.m128i_i32[1], vR.m128i_i32[2], vR.m128i_i32[3]);

                //printf("\n");
            }
        }
    }
}

These assume N is a multiple of 4, the vector width

If that's not the case, often it's easier to still pad your array storage to a multiple of the vector width, so there's padding at the end of each row and you can just use that simple j < N; j += 4 loop condition. You'll want to keep track of the real N size separately from the storage layout with a row stride that's a multiple of 4 or 8.

Otherwise you want a loop condition like j < N-3; j += 4`, and a scalar cleanup for the end of a row.

Or masking or keeping the last full vector in a register so you can _mm_alignr_epi8 with a maybe-overlapping final vector that ends at the end of the row, and maybe do a vector store. This is easier with AVX or especially AVX512 masking.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • You code is correct but will work only for dimensions which are multiple of 4.Can the code be made to work for any dimensions? – rojan sudev Apr 06 '20 at 08:34
  • @rojansudev: yes, you can do a scalar cleanup for the last `N & 3` elements of a row, if it's not a multiple of 4. Or better pad your rows. Updated the answer with some hints. – Peter Cordes Apr 06 '20 at 08:54