Performance of OpenMp code

Question

I have written a code for Matrix-Vector multiplication. The matrix is divided into blocks of rows based on the number of threads and each block is multiplied by the vector and the vector is stored in an array private to the thread. But my speedup is very poor. For matrix of size 16 X 16, it is below 1.

Can this be due to the fact that i declare my matrix and vector outside as shared variables and that maybe causing race condition/false sharing when each thread tries to read the value from the matrix and vector?

I am bit confused between false sharing and race condition.

#include <stdio.h>
#include <omp.h>
#include <stdlib.h>
#define  SIZE 128               // The size should be divisible by thenumber of threads

int main(int argc, char *argv[]) {

int thread_count = strtol(argv[1],NULL,10);
// Declare the variables
int i,j;
long A[SIZE][SIZE], b[SIZE],V[SIZE]={0};
//long Vect[SIZE]={0};
double start, end;
// Generate a matrix of size mxm
for (i=0; i<SIZE; i++)
{   for (j=0; j<SIZE; j++)
    A[i][j] = i+j;
}

printf("The Matrix is:\n");
// Print the Matrix
for (i=0; i<SIZE; i++)
{   for (j=0; j<SIZE; j++)
        {
        printf("%12ld", A[i][j]);
        }
printf("\n");

}

// Generate a vector of size m
for (i=0; i<SIZE; i++)
    b[i] = i;

printf("The vector is: \n");
// Print a vector
for (i=0; i<SIZE; i++)
    printf("%12ld\n", b[i]);


start = omp_get_wtime();
//omp_set_num_threads(NUM_THREADS);

#pragma omp parallel num_threads(thread_count)
{
int i,j,k, id, nthrds;
long Vect[SIZE]={0};
id = omp_get_thread_num();
nthrds = omp_get_num_threads(); 
for (i=id*SIZE/nthrds; i<(id*SIZE/nthrds + SIZE/nthrds); i++)
{   Vect[i] = 0;
    {
        for (j=0; j<SIZE; j++)
        Vect[i] += A[i][j]*b[j];
    }

}

#pragma omp critical
{
for (k=0; k<SIZE; k++)
V[k] += Vect[k]; 
}
}


end = omp_get_wtime();
printf("The vector obtained after multiplication is:\n");
for (i=0; i<SIZE; i++)
printf("%12ld\n", V[i]);
printf("The time taken for calculation is: %lf\n", end - start);


return 0;

}

It's very likely that for a workload that small (each thread is doing only 256/num_thread multiply-adds), the overhead of setting up thread-parallelization is greater than the speedup of that parallelization. And yes, having a shared write-state between the threads is very likely making the parallelization overhead even higher. — aruisdante, Feb 24 '15 at 18:04
For more on false-sharing: http://stackoverflow.com/questions/9027653/openmp-false-sharing?rq=1. For some interesting discussion on general OpenMP performance: http://stackoverflow.com/questions/10939158/openmp-performance?rq=1 — aruisdante, Feb 24 '15 at 18:10
Could you use some indentation in your programs? I can't even see where the parallel region ends. — Vladimir F Героям слава, Feb 24 '15 at 21:48
Before looking over you code it's important to point out that matrix*vector is a memory bandwidth bound process so if you implement it right it won't scale well with multiple threads (that does not mean threading won't help some) on a single socket system. — Z boson, Feb 25 '15 at 07:49

Z boson · Answer 1 · 2015-02-25T15:04:42.123

Let me make a few suggestions to improve your code.

It's rarely a good idea or necessary to parallelize a for loop by hand. One reason is that's it's error prone.

for (i=id*SIZE/nthrds; i<(id*SIZE/nthrds + SIZE/nthrds); i++)

should instead be

for (i=id*SIZE/nthrds; i<((id+1)*SIZE/nthrds; i++)

otherwise for some values of nthrds the result is wrong.

But instead of defining the chunks yourself let OpenMP do this for you.

#pragma omp parallel for private(j)
for(i=0; i<SIZE; i++) {
    long sum = 0;
    for(j=0; j<SIZE; j++) {
        sum += A[i][j]*b[j];
    }
    V[i] += sum;
}

You're right to worry about false sharing when writing to V. However, there is no need to define an array Vect for each thread. The code above addresses the false sharing you were concerned about by defining sum inside the inner loop. This code still has false sharing but instead of being for all i and j iterations (SIZE*SIZE) it's only for all i iterations (SIZE).
A SIZE of 128 is too small to overcome the OpenMP overhead. When I use a size of 8192 I see a significant improvement over the serial code. But your code has another problem for large sizes because you have used automatic variables for your arrays which are limited by the stack size. I suggest you use static variables which are not limited by the stack size.
Lastly, it's not fair to compare serial code using num_threads. The reason is that the compiler builds in OpenMP support even for num_threads(1). This biases the result. Instead you should compare with and without OpenMP enabled. Unfortunately, GCC does not allow you to use omp_get_wtime() without enabling OpenMP (though MSVC and ICC do). So if you're using GCC comment out the pragma when comparing serial code. With ICC you can enable only the stub functions. With MSVC don't enable OpenMP (omp_get_wtime() still works).

Here is code that address each of these points:

#include <stdio.h>
#include <omp.h>
#define  SIZE 8192

int main(void) {
    int i,j;
    double dtime;
    static long A[SIZE][SIZE], b[SIZE],V[SIZE];
    for (i=0; i<SIZE; i++) {
        for (j=0; j<SIZE; j++) {
            A[i][j] = i+j;
        }
    }
    for (i=0; i<SIZE; i++) b[i] = i;

    dtime = -omp_get_wtime();
    #pragma omp parallel for private(j) //comment out for one thread
    for(i=0; i<SIZE; i++) {
        long sum = 0;
        for(j=0; j<SIZE; j++) {
            sum += A[i][j]*b[j];
        }
        V[i] += sum;
    }    
    dtime += omp_get_wtime();
    printf("The time taken for calculation is: %lf\n", dtime);

    return 0;
}

Performance of OpenMp code

1 Answers1