OpenMP C++ Matrix Multiplication run slower in parallel

Question

I'm learning the basics of paralel execution of for loop using OpenMP.

Sadly, my paralel program runs 10x slower than serial version. What am I doing wrong? Am I missing some barriers?

double **basicMultiply(double **A, double **B, int size) {
   int i, j, k;
   double **res = createMatrix(size);
   omp_set_num_threads(4);
   #pragma omp parallel for private(k)
   for (i = 0; i < size; i++) {
      for (j = 0; j < size; j++) {
         for (k = 0; k < size; k++) {
            res[i][j] += A[i][k] * B[k][j];
         }
      }
   }
   return res;
}

Thank you very much!

For what value of `size` have you tried the code? Also you should mark both `k` and `j` private if you start specifying that for one of them. — rerx, Mar 25 '14 at 12:19
Note, you may want to add `collapse(3)` to your `parallel for` — otherwise only the outer loop is parallelized. — Wyzard, Mar 25 '14 at 12:37
I tried it with different sizes but it's always slower. Any other idea? I'm confused, because this should be one of the simpliest things... :O — Hynek Blaha, Mar 25 '14 at 12:55
Like what size? Also, what do you get if you call `omp_get_max_threads()`? — rerx, Mar 25 '14 at 13:12
The biggest size of matrices was 2048*2048. omp_get_max_threads() returns 16. — Hynek Blaha, Mar 25 '14 at 13:22
Did you make your `j` and `k` variables private as @rerx said? — Hristo Iliev, Mar 25 '14 at 14:11
Moreover, could you be more precise about how you do the timing? How many multiplications do you do? If you time just a single matrix product, I would guess that an overall overhead for threading will be more expensive. — rerx, Mar 25 '14 at 14:16
I measure time with `clock_t start=clock(); clock_t final = clock()-start; printf("Time: %f\n",final/(double) CLOCKS_PER_SEC);` I tried many sizes, but it always seems to be ten times slower. — Hynek Blaha, Mar 25 '14 at 14:28
The problem is likely due to j not being made private. Another example of the bug/feature with OpenMP implicitly making the first loop variable private and people assuming this applies to inner loops as well. — Z boson, Mar 25 '14 at 14:47
Since this is C++ you should use mixed declarations. Then you would never have this problem for(int i=0...) for(int j=0...). — Z boson, Mar 25 '14 at 14:52

Z boson · Accepted Answer · 2014-04-23T07:17:22.127

Your problem is due to a race condition on the inner loop variable j. It needs to be made private.

For C89 I would do something like this:

#pragma omp parallel
{
    int i, j, k;
    #pragma omp for
    for(i=0; ...

For C++ or C99 use mixed declarations

#pragma omp parallel for
for(int i=0; ...

Doing this you don't have to explicitly declare anything shared or private.

Some further comments to your code. Your single threaded code is not cache friendly when you do B[k][j]. This reads a cacheline then moves to the next cache line and so forth until the dot product is done by which time the other cachelines have been evicted. Instead you should take the transpose first and access as BT[j][k]. Additionally, you have allocated arrays of arrays and not one contiguous 2D array. I fixed your code to use the transpose and a contiguous 2D array.

Here are the times I get for size=512.

no transpose  no openmp 0.94s
no transpose, openmp    0.23s
tranpose, no openmp     0.27s
transpose, openmp       0.08s

Below is the code (also see http://coliru.stacked-crooked.com/a/ee174916fa035f97)

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

void transpose(double *A, double *B, int n) {
    int i,j;
    for(i=0; i<n; i++) {
        for(j=0; j<n; j++) {
            B[j*n+i] = A[i*n+j];
        }
    }
}

void gemm(double *A, double *B, double *C, int n) 
{   
    int i, j, k;
    for (i = 0; i < n; i++) { 
        for (j = 0; j < n; j++) {
            double dot  = 0;
            for (k = 0; k < n; k++) {
                dot += A[i*n+k]*B[k*n+j];
            } 
            C[i*n+j ] = dot;
        }
    }
}

void gemm_omp(double *A, double *B, double *C, int n) 
{   
    #pragma omp parallel
    {
        int i, j, k;
        #pragma omp for
        for (i = 0; i < n; i++) { 
            for (j = 0; j < n; j++) {
                double dot  = 0;
                for (k = 0; k < n; k++) {
                    dot += A[i*n+k]*B[k*n+j];
                } 
                C[i*n+j ] = dot;
            }
        }

    }
}

void gemmT(double *A, double *B, double *C, int n) 
{   
    int i, j, k;
    double *B2;
    B2 = (double*)malloc(sizeof(double)*n*n);
    transpose(B,B2, n);
    for (i = 0; i < n; i++) { 
        for (j = 0; j < n; j++) {
            double dot  = 0;
            for (k = 0; k < n; k++) {
                dot += A[i*n+k]*B2[j*n+k];
            } 
            C[i*n+j ] = dot;
        }
    }
    free(B2);
}

void gemmT_omp(double *A, double *B, double *C, int n) 
{   
    double *B2;
    B2 = (double*)malloc(sizeof(double)*n*n);
    transpose(B,B2, n);
    #pragma omp parallel
    {
        int i, j, k;
        #pragma omp for
        for (i = 0; i < n; i++) { 
            for (j = 0; j < n; j++) {
                double dot  = 0;
                for (k = 0; k < n; k++) {
                    dot += A[i*n+k]*B2[j*n+k];
                } 
                C[i*n+j ] = dot;
            }
        }

    }
    free(B2);
}

int main() {
    int i, n;
    double *A, *B, *C, dtime;

    n=512;
    A = (double*)malloc(sizeof(double)*n*n);
    B = (double*)malloc(sizeof(double)*n*n);
    C = (double*)malloc(sizeof(double)*n*n);
    for(i=0; i<n*n; i++) { A[i] = rand()/RAND_MAX; B[i] = rand()/RAND_MAX;}

    dtime = omp_get_wtime();
    gemm(A,B,C, n);
    dtime = omp_get_wtime() - dtime;
    printf("%f\n", dtime);

    dtime = omp_get_wtime();
    gemm_omp(A,B,C, n);
    dtime = omp_get_wtime() - dtime;
    printf("%f\n", dtime);

    dtime = omp_get_wtime();
    gemmT(A,B,C, n);
    dtime = omp_get_wtime() - dtime;
    printf("%f\n", dtime);

    dtime = omp_get_wtime();
    gemmT_omp(A,B,C, n);
    dtime = omp_get_wtime() - dtime;
    printf("%f\n", dtime);

    return 0;

}

@Zboson, Hi, I compared your code (complied with g++), with Matlab. The output of yours was 0.457343, 0.161412, 0.281850 and 0.105735. But Matlab did the job only in 0.002953 second. Do you have any idea how can reach the Matlab performance using C? Thanks. — user153245, Feb 19 '17 at 15:46
@user153245, yeah, you need to do loop tiling/blocking to make better use of the cache. If you do this you will probably get about 50% of Matlab. To do as well as Matlab is very hard though. — Z boson, Feb 19 '17 at 19:53
Is there a good reason you're using `malloc()`? I'd expect to see `new[]` there (and matching `delete[]`, unless you want to be hostile to debugging). — Toby Speight, Dec 20 '17 at 16:53
@TobySpeight no, I don't have a good reason for using malloc. — Z boson, Dec 21 '17 at 13:21

Alexander Korovin · Answer 2 · 2020-04-13T09:28:47.707

In addition. "Z boson", I have tested your C code on the laptop with intel i5 (2 physical cores or 4 logical ones). Unfortunately, the calculation speed is not very fast. For 2000x2000 random double matrices I obtained the following results (using VS 2010 with OpenMP 2.0):

Compiled for Win64: C = A*B, where A,B are matrices with the size (2000x2000):

max number of threads = 4
Create random matrices: = 0.303555 s
no transpose no openmp = 100.539924 s
no transpose, openmp = 47.876084 s
transpose, no openmp = 27.872169 s
transpose, openmp = 15.821010 s

Compiled for Win32: C = A*B, where A,B are matrices with the size (2000x2000):

max number of threads = 4
Create random matrices: = 0.378804 s
no transpose no openmp = 98.613992 s
no transpose, openmp = 48.233655 s
transpose, no openmp = 29.590350 s
transpose, openmp = 13.678097 s

Note that for the "Hynek Blaha" code the calculation time on my system is 739.208s (226.62s with openMP)!

Whereas in Matlab x64:

n = 2000; 
A = rand(n); B = rand(n);

tic
C = A*B;
toc

the calculation time is 0.591440 seconds.

But using openBLAS package I reached a speed of 0.377814 seconds (using minGW with openMP 4.0). The Armadillo package provides a simple way (in my opinion) for connection of matrix operations with openBLAS (or other similar packages). In this case the code is

#include <iostream>
#include <armadillo>
using namespace std;
using namespace arma;

int main(){
    int n = 2000;
    int N = 10; // number of repetitions
    wall_clock timer;

    arma_rng::set_seed_random();

    mat A(n, n, fill::randu), B(n, n, fill::randu);

    timer.tic();
    // repeat simulation N times
    for(int n=1;n<N;n++){
      mat C = A*B;
    }
    cout << timer.toc()/double(N) << "s" << endl;

    return 0;
}

This is such a great example! I'm currently struggling with the OpenMP, I experienced a bad performance even by just setting all the value of a large matrix. Could you have a look at my question? any suggestion would be appreciated! http://stackoverflow.com/questions/40700927/improve-openmp-multi-threading-parallel-computing-efficiency-for-matrix-multi-d — Valar Morghulis, Nov 20 '16 at 17:23
Just a small comment regarding **MATLAB** times. Since the beginning of this millennium, MATLAB incorporates MKL (LAPACK) for LA and matrix calculations. You can check the BLAS version of MATLAB with `version -blas`. — Alaroff, Dec 21 '17 at 13:39

score 3 · Answer 3 · answered Mar 25 '14 at 12:18

3

If size is small, the overhead of thread-synchronization will shadow any performance gain from parallel computation.

answered Mar 25 '14 at 12:18

rerx

1,133
8
19

OpenMP C++ Matrix Multiplication run slower in parallel

3 Answers3

Linked