parallelizing matrix multiplication through threading and SIMD

Question

I am trying to speed up matrix multiplication on multicore architecture. For this end, I try to use threads and SIMD at the same time. But my results are not good. I test speed up over sequential matrix multiplication:

void sequentialMatMul(void* params)
{
    cout << "SequentialMatMul started.";
    int i, j, k;
    for (i = 0; i < N; i++)
    {
        for (k = 0; k < N; k++)
        {
            for (j = 0; j < N; j++)
            {
                X[i][j] += A[i][k] * B[k][j];
            }
        }
    }
    cout << "\nSequentialMatMul finished.";
}

I tried to add threading and SIMD to matrix multiplication as follows:

void threadedSIMDMatMul(void* params)
{
    bounds *args = (bounds*)params;
    int lowerBound = args->lowerBound;
    int upperBound = args->upperBound;
    int idx = args->idx;

    int i, j, k;
    for (i = lowerBound; i <upperBound; i++)
    {
        for (k = 0; k < N; k++)
        {
            for (j = 0; j < N; j+=4)
            {
                mmx1 = _mm_loadu_ps(&X[i][j]);
                mmx2 = _mm_load_ps1(&A[i][k]);
                mmx3 = _mm_loadu_ps(&B[k][j]);
                mmx4 = _mm_mul_ps(mmx2, mmx3);
                mmx0 = _mm_add_ps(mmx1, mmx4);
                _mm_storeu_ps(&X[i][j], mmx0);
            }
        }
    }
    _endthread();
}

And the following section is used for calculating lowerbound and upperbound of each thread:

bounds arg[CORES];
for (int part = 0; part < CORES; part++)
{
    arg[part].idx = part;
    arg[part].lowerBound = (N / CORES)*part;
    arg[part].upperBound = (N / CORES)*(part + 1);
}

And finally threaded SIMD version is called like this:

HANDLE  handle[CORES];      
for (int part = 0; part < CORES; part++)
{
    handle[part] = (HANDLE)_beginthread(threadedSIMDMatMul, 0, (void*)&arg[part]);
}
for (int part = 0; part < CORES; part++)
{
WaitForSingleObject(handle[part], INFINITE);
}

The result is as follows: Test 1:

// arrays are defined as follow
float A[N][N];
float B[N][N];
float X[N][N];
N=2048
Core=1//just one thread

Sequential time: 11129ms

Threaded SIMD matmul time: 14650ms

Speed up=0.75x

Test 2:

//defined arrays as follow
float **A = (float**)_aligned_malloc(N* sizeof(float), 16);
float **B = (float**)_aligned_malloc(N* sizeof(float), 16);
float **X = (float**)_aligned_malloc(N* sizeof(float), 16);
for (int k = 0; k < N; k++)
{
    A[k] = (float*)malloc(cols * sizeof(float));
    B[k] = (float*)malloc(cols * sizeof(float));
    X[k] = (float*)malloc(cols * sizeof(float));
}
N=2048
Core=1//just one thread

Sequential time: 15907ms

Threaded SIMD matmul time: 18578ms

Speed up=0.85x

Test 3:

//defined arrays as follow
float A[N][N];
float B[N][N];
float X[N][N];
N=2048
Core=2

Sequential time: 10855ms

Threaded SIMD matmul time: 27967ms

Speed up=0.38x

Test 4:

//defined arrays as follow
float **A = (float**)_aligned_malloc(N* sizeof(float), 16);
float **B = (float**)_aligned_malloc(N* sizeof(float), 16);
float **X = (float**)_aligned_malloc(N* sizeof(float), 16);
for (int k = 0; k < N; k++)
{
    A[k] = (float*)malloc(cols * sizeof(float));
    B[k] = (float*)malloc(cols * sizeof(float));
    X[k] = (float*)malloc(cols * sizeof(float));
}
N=2048
Core=2

Sequential time: 16579ms

Threaded SIMD matmul time: 30160ms

Speed up=0.51x

My question: why I don’t get speed up?

I'm guessing that the naive implementation of matrix multiplication (which directly uses the definition of the product) does not translate too well to to SIMD. In the first parallelization, the value `A[i][k]` needs to be copied into all four components of the register. The desired entry`X[i][j]` can be expressed as the sum of the products of the `i`-th row of `A` with the `j`-th column of `B`. If `B` would be stored column-wise instead of row-wise, the parallelism could be applied in a more straightforward way and would also be more cache-friendly. — Codor, Jun 15 '15 at 07:29
[openmp-c-matrix-multiplication-run-slower-in-parallel](https://stackoverflow.com/questions/22634121/openmp-c-matrix-multiplication-run-slower-in-parallel/22637933#22637933) — Z boson, Jun 15 '15 at 08:09
When you define arrays rather than pointer to arrays I assume you're using global arrays (if so I would make them static) because the stack is too small for 2048x2048 arrays. — Z boson, Jun 15 '15 at 14:15
@Codor . thank you for your comment, of course transposing matrix B can improve result. — estjop, Jun 15 '15 at 15:39

Z boson · Accepted Answer · 2015-06-15T17:32:21.263

Here are the times I get building on your algorithm on my four core i7 IVB processor.

sequential:         3.42 s
4 threads:          0.97 s
4 threads + SSE:    0.86 s

Here are the times on a 2 core P9600 @2.53 GHz which is similar to the OP's E2200 @2.2 GHz

sequential: time    6.52 s
2 threads: time     3.66 s
2 threads + SSE:    3.75 s

I used OpenMP because it makes this easy. Each thread in OpenMP runs over effectively

lowerBound = N*part/CORES;
upperBound = N*(part + 1)/CORES;

(note that that is slightly different than your definition. Your definition can give the wrong result due to rounding for some values of N since you divide by CORES first.)

As to the SIMD version. ~~It's not much faster probably due it being memory bandwidth bound~~ . It's probably not really faster because GCC already vectroizes the loop.

The most optimal solution is much more complicated. You need to use loop tiling and reorder the elements within tiles to get the optimal performance. I don't have time to do that today.

Here is the code I used:

//c99 -O3 -fopenmp -Wall foo.c
#include <stdio.h>
#include <string.h>
#include <x86intrin.h>
#include <omp.h>

void gemm(float * restrict a, float * restrict b, float * restrict c, int n) {
    for(int i=0; i<n; i++) {
        for(int k=0; k<n; k++) {
            for(int j=0; j<n; j++) {
                c[i*n+j] += a[i*n+k]*b[k*n+j];
            }
        }
    }
}

void gemm_tlp(float * restrict a, float * restrict b, float * restrict c, int n) {
    #pragma omp parallel for
    for(int i=0; i<n; i++) {
        for(int k=0; k<n; k++) {
            for(int j=0; j<n; j++) {
                c[i*n+j] += a[i*n+k]*b[k*n+j];
            }
        }
    }
}   

void gemm_tlp_simd(float * restrict a, float * restrict b, float * restrict c, int n) {
    #pragma omp parallel for
    for(int i=0; i<n; i++) {
        for(int k=0; k<n; k++) {
            __m128 a4 = _mm_set1_ps(a[i*n+k]);
            for(int j=0; j<n; j+=4) {
                __m128 c4 = _mm_load_ps(&c[i*n+j]);
                __m128 b4 = _mm_load_ps(&b[k*n+j]);
                c4 = _mm_add_ps(_mm_mul_ps(a4,b4),c4);
                _mm_store_ps(&c[i*n+j], c4);
            }
        }
    }
}

int main(void) {
    int n = 2048;
    float *a = _mm_malloc(n*n * sizeof *a, 64);
    float *b = _mm_malloc(n*n * sizeof *b, 64);
    float *c1 = _mm_malloc(n*n * sizeof *c1, 64);
    float *c2 = _mm_malloc(n*n * sizeof *c2, 64);
    float *c3 = _mm_malloc(n*n * sizeof *c2, 64);
    for(int i=0; i<n*n; i++) a[i] = 1.0*i;
    for(int i=0; i<n*n; i++) b[i] = 1.0*i;
    memset(c1, 0, n*n * sizeof *c1);
    memset(c2, 0, n*n * sizeof *c2);
    memset(c3, 0, n*n * sizeof *c3);
    double dtime;

    dtime = -omp_get_wtime();
    gemm(a,b,c1,n);
    dtime += omp_get_wtime();
    printf("time %f\n", dtime);

    dtime = -omp_get_wtime();
    gemm_tlp(a,b,c2,n);
    dtime += omp_get_wtime();
    printf("time %f\n", dtime);

    dtime = -omp_get_wtime();
    gemm_tlp_simd(a,b,c3,n);
    dtime += omp_get_wtime();
    printf("time %f\n", dtime);
    printf("error %d\n", memcmp(c1,c2, n*n*sizeof *c1));
    printf("error %d\n", memcmp(c1,c3, n*n*sizeof *c1));
}

@ Z boston . thak you 'Z boston'. I tried your code in Dev-c++ environment on Windows 7 with Intel(R) Pentium(R) Dual CPU E2200 @ 2.20GHz and 3 GB ram. At first I compiled with -fopenmp -std=c99 commands. The times I get are: 134.909, 133.115, and 52.494 respectively for sequential, 2 threads and 2 threads + sse. Then I tried with -O3 -fopenmp -Wall -Std=c99 commands. I get following times: 10.576, 9.657, and 9.15. The speed up of threaded+sse of last try over sequential in my computer is 1.09x, and your speed up is approx. 4x. Is this difference in speed up related to the CPU architecture? — estjop, Jun 15 '15 at 15:35
@Zboston. It is a question for me that what is the effect of -O3 flag on threaded+sse version? — estjop, Jun 15 '15 at 15:36
@estjop, it's pointless to measure performance without optimization enabled. I added times for a system similar to yours. I don't know why you're not seeing any benefit from threading. You should do a little test to maker sure you're getting two threads with OpenMP. — Z boson, Jun 15 '15 at 17:40
@Zboston, I tested code, I am sure that I'm getting two threads, but it seems speed up is not enough! This time my speed up was 1.34x. I have to test code on different computer. Could this problem be related to my compiler (Dev-c++, is tested with both MinGW GCC 4.7.2 and TDM-GCC 4.9.2)? — estjop, Jun 15 '15 at 18:37
@estjop, I don't know about your compiler. I tested this with GCC 4.9 with Linux Kernel 3.18 64-bit. Are you compiling in 64-bit mode? What happens if you add `-m64` to your options? — Z boson, Jun 15 '15 at 18:48
@Z boston, my hardware is 32-bit. I compiled with '-m65', compiler sends this message: 'sorry, unimplemented: 64-bit mode not compiled in'. — estjop, Jun 16 '15 at 03:40
@estjop, you need to tell your compiler to use SSE instead of x87 for floating point. I would use `gcc -std=c99 -O3 -fopenmp -mfpmath=sse -msse2 -Wall foo.c`. The options `-mfpmath=sse -msse2` tell GCC to use SSE instead of x87 and to enable SSE2. — Z boson, Jun 16 '15 at 06:59

score 1 · Answer 2 · answered Jun 15 '15 at 14:53

1

It looks to me that the threads are sharing __m128 mmx* variables, you probably defined them global/static. You must be getting wrong results in your X array too. Define __m128 mmx* variables inside threadedSIMDMatMul function scope and it will run much faster.

void threadedSIMDMatMul(void* params)
{
    __m128 mmx0, mmx1, mmx2, mmx3, mmx4;
    // rest of the code here
}

answered Jun 15 '15 at 14:53

doqtor

8,414
2
20
36

Yes, they are global. But I've checked results of different versions, results are same as sequential. – estjop Jun 15 '15 at 15:47
@estjop Really? I'm getting results similar to Z boston - multi-threaded version is faster than sequential, SIMD in this form does not add any benefit. – doqtor Jun 15 '15 at 17:02
@estjop: The correctness of your results depends on the compiler optimizing away any stores/loads to the `mmx[0-8]` global variables. If you ever modify the code in a way that spills a register to memory, then your code is suddenly not thread-safe anymore. – Peter Cordes Jul 03 '15 at 23:43

parallelizing matrix multiplication through threading and SIMD

2 Answers2

Linked