Calculating matrix product is much slower with SSE than with straight-forward-algorithm

Question

I want to multiply two matrices, one time by using the straight-forward-algorithm:

template <typename T>
void multiplicate_straight(T ** A, T ** B, T ** C, int sizeX)
{
    T ** D = AllocateDynamicArray2D<T>(sizeX, sizeX);
    transpose_matrix(B, D,sizeX);
    for(int i = 0; i < sizeX; i++)
    {
        for(int j = 0; j < sizeX; j++)
        {
            for(int g = 0; g < sizeX; g++)
            {
                C[i][j] += A[i][g]*D[j][g];
            }
        }
    }
    FreeDynamicArray2D<T>(D);
}

and one time via using SSE functions. For this I created two functions:

template <typename T>
void SSE_vectormult(T * A, T * B, int size)
{

    __m128d a;
    __m128d b;
    __m128d c;
#ifdef linux
    double A2[2], B2[2], C[2] __attribute__ ((aligned(16)));
#endif
#ifdef _WIN32
    __declspec(align(16)) double A2[2], B2[2], C[2];
#endif
    for(int i = 0; i < size; i+=2)
    {
        //std::cout << "In SSE_vectormult: i is: " << i << '\n';
        A2[0] = A[i];
        B2[0] = B[i];
        A2[1] = A[i+1];
        B2[1] = B[i+1];
        //std::cout << "Values from A and B written to A2 and B2\n";
        a = _mm_load_pd(A2);
        b = _mm_load_pd(B2);
        //std::cout << "Values converted to a and b\n";
        c = _mm_mul_pd(a,b);
        _mm_store_pd(C, c);
        A[i] = C[0];
        A[i+1] = C[1];
    };
}

and

template <typename T>
void multiplicate_SSE(T ** A, T ** B, T ** C, int sizeX)
{
//    std::cout << "Entered SSE-Function\n";
    T ** D = AllocateDynamicArray2D<T>(sizeX, sizeX);
    T * tmp = AllocateDynamicArray1D<T>(sizeX);
    T * tmp2 = AllocateDynamicArray1D<T>(sizeX);
    //std::cout << "Matrices allocated\n";
    transpose_matrix<T>(B, D,sizeX);
    //std::cout << "Matrix B transposed\n";
    for(int i = 0; i < sizeX; i++)
    {
        for(int j = 0; j < sizeX; j++)
        {
            extract_row<T>(A,tmp, i, sizeX);
//            std::cout << "Row from A extracted\n";
            //print_vector(tmp, sizeX);
            extract_row<T>(D, tmp2, j, sizeX);
//            std::cout << "Row from D extracted\n";
            //print_vector(tmp2, sizeX);
            SSE_vectormult<T>(tmp, tmp2, sizeX);
//            std::cout << "Vectors multiplicated\n";
            //print_vector(tmp, sizeX);
            C[i][j] = add_vector(tmp, sizeX);
//            std::cout << "Written value to C\n";
//            std::cout << "j is " << j << " and i is " << i << '\n';
        }
    }
//    std::cout << "Loop finished\n";
    FreeDynamicArray2D<T>(D);
    //std::cout << "Freed D\n";
    //FreeDynamicArray1D<T>(tmp);????
//    std::cout << "Freed tmp\n";
    FreeDynamicArray1D<T>(tmp2);
//    std::cout << "Everything freed, returning\n";
}

But then I get several problems: On the one hand, when I want to free the tmp-array in multiplicate_SSE(), marked with several question marks, I get the error "_BLOCK_TYPE_IS_VALID". I thought about the possibility of freeing the same space twice, so I uncommented this (but as I suppose I get a memory leak through this?). Now, when I compare the performance of both functions with identical matrices, the SSE function needs around four times longer for two 1024x1024-matrices than the straight-forward-method.
How can I rewrite my SSE-Function to get a better performance (I have never worked with SSE before), and how can I fix my memory leak?
Thank you!

Looks like you've fallen in to the usual trap of having too much data-movement and too little computation. You have one actual arithmetic instruction for 9+ memory accesses. That's a ratio of 1/9. If you want any sort of decent speedup, you need that ratio to be something like 2/1 at the least. — Mysticial, Dec 06 '13 at 23:51
@Mysticial: Is there then the possibility to rewrite the function to get a speed up, or is my problem not suitable for this? — arc_lupus, Dec 06 '13 at 23:52
Matrix multiply can be done in a way that has a high compute/memory-access ratio. But it's not easy. — Mysticial, Dec 06 '13 at 23:54
One minor suggestion which applies to both methods is to skip the transpose, and sum over the first (instead of second) dimension of the second matrix. That is: `C[i][j] += A[i][g]*D[g][j];` — Matt, Dec 06 '13 at 23:55
@Matt: But this would not reduce my data-movement overhead... — arc_lupus, Dec 06 '13 at 23:56
If your data type is `double` then there is usually little to be gained from SSE on modern CPUs anyway, since there are typically two scalar FPUs and SSE is only 2-wide for doubles. AVX would still be worthwhile though. — Paul R, Dec 07 '13 at 07:17
@PaulR, sandy bridge can do one SSE addition and one SSE multiplication at the same time so SSE with doubles should still be twice as fast as using x87 if done right. — Z boson, Dec 07 '13 at 09:16
@arc_lupus, you don't want to take exactly the transpose when using SIMD. You need something different. You transpose strips with the strip width equal to the SIMD width. Look into strip mining. Better yet you want to do this in tiles rather than for the whole matrix. I can show you how to do this in a few days. — Z boson, Dec 07 '13 at 09:21
@arc_lupus, I added some code using SSE which I think should be at least 4x as fast as your scalar code (assuming the matricies fit in the cache). — Z boson, Dec 07 '13 at 18:05
WHY WHY WHY are you using templates here? If you want the same function to operate on matrices of both 32-bit and 64-bit floats, then I daresay you don't actually care about performance and are wasting your time with SSE. And if your type `T` as you call it is *really* secretly just a 64-bit float, then WHY THE TEMPLATE??? — Andrey Mishchenko, Jan 06 '14 at 14:09

Z boson · Accepted Answer · 2014-01-06T14:42:39.723

You have the right idea by taking the transpose in the scalar code but you don't want exactly the transpose when using SSE.

Let's stick to float (SGEMM). What you want to do with SSE is do four dot products at once. You want C = A*B. Let's look at a 8x8 matrix. Let's assume B is:

(0   1  2  3) ( 4  5  6  7)
(8   9 10 11) (12 13 14 15) 
(16 17 18 19) (20 21 22 23)
(24 25 26 27) (28 29 30 31)
(32 33 34 35) (36 37 38 39)
(40 41 42 43) (44 45 46 47)
(48 49 50 51) (52 53 54 55)
(56 57 58 59) (60 61 62 63)

So with SSE you do:

C[0][0] C[0][1] C[0][2] C[0][3] = 
A[0][0]*(0 1 2 3) + A[0][1]*(8 9 10 11) + A[0][2]*(16 17 18 19)...+ A[0][7]*(56 57 58 59)

That gets you four dot products at once. The problem is that you have to move down a column in B and the values are not in the same cache line. It would be better if each column of width four was contiguous in memory. So instead of taking the transpose of each element you transpose strips with a width of 4 like this:

(0  1  2  3)( 8  9 10 11)(16 17 18 19)(24 25 26 27)(32 33 34 35)(40 41 42 43)(48 49 50 51)(56 57 58 59)
(4  5  6  7)(12 13 14 15)(20 21 22 23)(28 29 30 31)(36 37 38 39)(44 45 46 47)(52 53 54 55)(60 61 62 63)

If you think of each of the four values in parentheses as one unit this is equivalent to transposing a 8x2 matrix to a 2x8 matrix. Notice now that the columns of width four of B are contiguous in memory. This is far more cache friendly. For an 8x8 matrix this is not really an issue but for example with a 1024x1024 matrix it is. See the code below for how to do this. For AVX you transpose strips of width 8 (which means you have nothing to do for a 8x8 matrix). For double the width is two with SSE and four with AVX.

This should be four times faster than the scalar code assuming the matrices fit in the cache. However, for large matrices this method is still going to be memory bound and so your SSE code may not be much faster than scalar code (but it should not be worse).

However, if you use loop tiling and rearranging the matrix in tiles (which fit in the L2 cache) rather than for the whole matrix matrix multiplication is computation bound and not memory bound even for very large matrices that don't fit in the L3 cache. That's another topic.

Edit: some (untested) code to compare to your scalar code. I unrolled the loop by 2.

void SGEMM_SSE(const float *A, const float *B, float *C, const int sizeX) {
    const int simd_width = 4;
    const int unroll = 2;
    const int strip_width = simd_width*unroll
    float *D = (float*)_mm_malloc(sizeof(float)*sizeX*sizeX, 16);
    transpose_matrix_strip(B, D,sizeX, strip_width); //tranpose B in strips of width eight
    for(int i = 0; i < sizeX; i++) {
        for(int j = 0; j < sizeX; j+=strip_width) {
            float4 out_v1 = 0; //broadcast (0,0,0,0)
            float4 out_V2 = 0;
            //now calculate eight dot products
            for(int g = 0; g < sizeX; g++) {
                //load eight values rrom D into two SSE registers
                float4 vec4_1.load(&D[j*sizeX + strip_width*g]);
                float4 vec4_2.load(&D[j*sizeX + strip_width*g + simd_width]);
                out_v1 += A[i][g]*vec4_v1;
                out_v2 += A[i][g]*vec4_v2;
            }
            //store eight dot prodcuts into C
            out_v1.store(&C[i*sizeX + j]);
            out_v2.store(&C[i*sizeX + j + simd_width]);
        }
    }
    _mm_free(D);
}

void transpose_matrix_strip(const float* A, float* B, const int N, const int strip_width) {
    //#pragma omp parallel for
    for(int n=0; n<N*N; n++) {
        int k = strip_width*(n/N/strip_width);
        int i = (n/strip_width)%N;
        int j = n%strip_width;
        B[n] = A[N*i + k + j];
    }
}

Notice that j increments by eight now. More unrolling may help. If you want to use intrinsics you can use _mm_load_ps, _mm_store_ps, _mm_set1_ps (for the broadcasts e.g. _mm_set1_ps(A[i][g])), _mm_add_ps, and _mm_mul_ps. That's it.

score 1 · Answer 2 · edited May 23 '17 at 12:33

I believe this should do the same thing as the first loop with SSE, assuming sizeX is a multiple of two and the memory is 16-byte aligned.

You may gain a bit more performance by unrolling the loop and using multiple temp variables which you add together at the end. You could also try AVX and the new Fused Multiply Add instruction.

template <typename T>
void multiplicate_SSE2(T ** A, T ** B, T ** C, int sizeX)
{
    T ** D = AllocateDynamicArray2D<T>(sizeX, sizeX);
    transpose_matrix(B, D,sizeX);
    for(int i = 0; i < sizeX; i++)
    {
        for(int j = 0; j < sizeX; j++)
        {
            __m128d temp = _mm_setzero_pd();
            for(int g = 0; g < sizeX; g += 2)
            {
                __m128d a = _mm_load_pd(&A[i][g]);
                __m128d b = _mm_load_pd(&D[j][g]);
                temp = _mm_add_pd(temp, _mm_mul_pd(a,b));
            }
            // Add top and bottom half of temp together
            temp = _mm_add_pd(temp, _mm_shuffle_pd(temp, temp, 1));
            _mm_store_sd(temp, &C[i][j]); // Store one value
        }
    }
    FreeDynamicArray2D<T>(D);
}

Calculating matrix product is much slower with SSE than with straight-forward-algorithm

2 Answers2

Linked