Multi-threaded speed-up only after making array private

Question

I am trying to learn multi-threaded programming using openmp.

To begin with, I was testing out a nested loop with a large number of array access operations, and then parallelizing it. I am attaching the code below. Basically, I have this fairly large array tmp in the interior loop, and if I make it shared so that every thread can access and change it, my code actually slows down with increasing number of threads. I have written it so that every thread writes the exact same values to array tmp. When I make tmp private, I get speed up proportional to the number of threads. The no. of operations seem to me to be exactly the same in both cases. Why is it slowing down when tmp is shared ? Is it because different threads try to access the same address at the same time ?

int main(){
    int k,m,n,dummy_cntr=5000,nthread=10,id;
    long num=10000000;
    double x[num],tmp[dummy_cntr];
    double tm,fact;
    clock_t st,fn;

    st=clock();
    omp_set_num_threads(nthread);
#pragma omp parallel private(tmp)
    {
        id = omp_get_thread_num();
        printf("Thread no. %d \n",id);
#pragma omp for
        for (k=0; k<num; k++){
            x[k]=k+1;
            for (m=0; m<dummy_cntr; m++){
                tmp[m] = m;
            }
        }
    }
    fn=clock();
    tm=(fn-st)/CLOCKS_PER_SEC;
}

P.S.: I am aware that using clock() here doesn't really give the correct time. I have to divide it by the no. of threads in this case to get a similar output as given by "time ./a.out".

Use `omp_get_wtime()` for a portable wall-time clock and read about things like [false sharing](http://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads) to understand why the slowdown happens. — Hristo Iliev, Jun 21 '13 at 10:20
Or you could see my answer where I used `omp_get_wtime()` and fixed your false sharing problem in `tmp`. — , Jun 21 '13 at 13:22

score 5 · Accepted Answer · answered Jun 19 '13 at 15:25

5

This may be due to cache contention: if a part of the array is accessed by two threads or more it will be cached multiple times, one copy for each core: when one core needs to access it, if the data have been changed, it will need to fetch the latest version from another core cache which takes some time.

answered Jun 19 '13 at 15:25

Pragmateek

13,174
9
74
108

2

It's called _false sharing_ when two or more threads access array elements that fall in the same cache line and _true sharing_ if two or more threads access the same array element at the same time. – Hristo Iliev Jun 21 '13 at 10:17
Thanks @Hristolliev and Pragmateek. This was really helpful. – GreenEye Jun 21 '13 at 13:07

score 1 · Answer 2 · edited May 23 '17 at 12:21

1

Your code has race conditions in tmp and m. I don't know what you are really trying to do but this link might be helpful Fill histograms (array reduction) in parallel with OpenMP without using a critical section

I tried cleaning up your code. This code allocates memory fortmp for each thread which solves your problem with false sharing in tmp.

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main() {
    int k,m,dummy_cntr=5000;
    long num=10000000;
    double *x, *tmp;
    double dtime;

    x = (double*)malloc(sizeof(double)*num);

    dtime = omp_get_wtime();
    #pragma omp parallel private(tmp, k)
    {
        tmp = (double*)malloc(sizeof(double)*dummy_cntr);
        #pragma omp for
        for (k=0; k<num; k++){
            x[k]=k+1;
            for (m=0; m<dummy_cntr; m++){
                tmp[m] = m;
            }
        }
        free(tmp);
    }
    dtime = omp_get_wtime() - dtime;
    printf("%f\n", dtime);
    free(x);
    return 0;
}

Compiled with

gcc -fopenmp -O3 -std=c89 -Wall -pedantic foo.c

edited May 23 '17 at 12:21

Community

1
1

answered Jun 19 '13 at 17:06

Thanks. But, do you need to declare k as private ? I thought each thread automatically gets their own copies of the parallelized for loop index. – GreenEye Jun 20 '13 at 10:57
See this example where the OP did not make the loop indices private and he got the wrong results. Read the comments and the answers. – Jun 20 '13 at 12:22
Opps..I forgot the example. Here it is http://stackoverflow.com/questions/16989917/parallelizing-with-openmp/17001908#17001908 – Jun 20 '13 at 12:38
Counter variables in parallel loops have predetermined data-sharing class of `private`. One can explicitly declare them as such if they come from an outer scope, but doing so is redundant. – Hristo Iliev Jun 21 '13 at 10:13
@HristoIliev, please explain then why the code in this example failed until the OP made the counter variables `i` and `j` private http://stackoverflow.com/questions/16989917/parallelizing-with-openmp/17001908#17001908 – Jun 21 '13 at 10:22
Obviously it fails since the inner loop over `j` is not parallel and so is not the second loop over `i`. In the GreenEye's code the loop over `k` **is** parallel and hence `k` is predetermined to be private in the scope of the `for` work-sharing construct. – Hristo Iliev Jun 21 '13 at 10:33
@HristoIliev, I see what you mean. Thanks for the explanation. But am I correct then to say that GreenEye needs to make `m` private (which he did not do) since it's used in the inner loop which is not parallel? – Jun 21 '13 at 11:57

Multi-threaded speed-up only after making array private

2 Answers2