parallel multiply matrix openmp is slower than sequential

Question

I'm a new OpenMp Programer and now I got a problem with multiplying two matrices. This is my parallel code but it is not as fast as I expected. For example I give it a 3000 * 3000 matrix and 3000 * 3000 and my Domain is 2 ( the random number is 0 or 1 ) and parallel is slower than sequential

    clock_t tStart = clock();
    cout<<(char)169<<" parallel "<<(char)170<<endl;
    int a,b,c,Domain ;
    cin>>a>>b>>c>>Domain;
    srand(time(0));
    int **arr1;
    int **arr2;
    int **arrRet;

    arr1 = new int*[a];
    #pragma omp for schedule (dynamic)
    for(int i=0 ; i<a ; i++)
    arr1[i] = new int [b];

    arr2 = new int*[b];
    #pragma omp for schedule (dynamic)
    for(int i=0 ; i<b ; i++)
    arr2[i] = new int [c];

    arrRet = new int*[a];
    #pragma omp for schedule (dynamic)
    for(int i=0 ; i<a ; i++)
    arrRet[i] = new int [c];

    #pragma omp for schedule (dynamic)
    for(int i=0 ; i<a ; i++)
    {
        #pragma omp for schedule (dynamic)
        for(int j=0; j<b ; j++)
        {
        arr1[i][j]=rand()%Domain;
        }
    }

    //cout<<"\n\n\n";
    #pragma omp for schedule (dynamic)
    for(int i=0 ; i<b ; i++)
    {
        #pragma omp for schedule (dynamic)
        for(int j=0 ; j<c ; j++)
        {
        arr2[i][j]=rand()%Domain;
        }
    }

    //cout<<"\n\n\n";
    #pragma omp for schedule (dynamic)
    for(int i=0 ; i<a ; i++)
        #pragma omp for schedule (dynamic)
        for(int j2=0 ; j2<c ; j2++)
        {
            int sum=0;
            #pragma omp parallel for shared(sum) reduction(+:sum)
            for(int j=0 ; j<b ; j++)
            {
                sum+=arr1[i][j]*arr2[j][j2];
            }
            arrRet[i][j2]=sum;
        }
    printf("Time taken : %.4fs\n", (double)(clock() - tStart) / CLOCKS_PER_SEC);

Possible duplicate of [OpenMP time and clock() calculates two different results](https://stackoverflow.com/questions/10673732/openmp-time-and-clock-calculates-two-different-results) and https://stackoverflow.com/q/10624755/620382 — Zulan, Aug 14 '17 at 07:51
Don't do matrix multiplication yourself. That's insane. Matrix multiplication is the oldest problem in the book. Get some library to do that for you, such as OpenBLAS. Also use [Armadillo](http://arma.sourceforge.net/) for holding your matrices. Stop reserving these lame arrays to hold matrices. They're slow because your compiler can't [vectorize](https://en.wikipedia.org/wiki/Automatic_vectorization) them. You can link Armadillo with OpenBLAS and it'll do the parallelization for you and your processor features to give you the best performance. — The Quantum Physicist, Aug 14 '17 at 08:18
@TheQuantumPhysicist My Teacher tell me i can't use library :( — ali, Aug 14 '17 at 08:26
OK. Then use `std::vector`, and don't use multidimensional arrays. Use a 1-dimensional array and create an accessor function that will access element (i,j) in the 1 dimensional array. That way you make everything the fastest. Also keep in mind that more threads doesn't mean faster results. Start with 1 thread, and keep increasing the number and study the speed as a function of the number of cores, and try to understand the results. — The Quantum Physicist, Aug 14 '17 at 08:30

Jorge Bellon · Answer 1 · 2017-08-24T07:59:39.750

There are many highly optimized linear algebra libraries that are free to use. I strongly suggest you to use one of those whenever possible.

Your performance degradation may be produced by many reasons. The following list details some of the most common causes:

Use of schedule(dynamic) when the amount of work per iteration is completely balanced. Omitting the clause will set the schedule to static, which is more appropriate for this type of parallelization.
Excessive pressure on the memory allocation. You don't actually need to reserve multiple memory regions for a single matrix. Since the matrix size does not change in your program, you can perfectly use a single allocation for each matrix. This also improves data locality, as contiguous rows are close to each other in memory. Then, you can access each element using A[ i * b + j ], where b is the number of columns.

int *A = (int *) malloc( a * b * sizeof(int) );

In your code, you seem to have missed a parallel region. This causes that all the omp for with the exception of the last one, are not executed by multiple threads.
Merge your omp for constructs in nested loops using collapse(2) as in the following example:

#pragma omp for collapse(2)
for( i = 0; i < a; i++ ) {
   for( j = 0; j < b; j++ ) {
      // your parallel code
   }
}

parallel multiply matrix openmp is slower than sequential

1 Answers1