Calculating pi using Monte Carlo method in two ways in C++ OpenMP

Question

What method should be faster? First method is increment one variable for reduction:

#pragma omp parallel private(seed, x, y, i) reduction (+:counter)
{
    seed = 25234 + 17 * omp_get_thread_num();
    nproc = omp_get_thread_num();
    #pragma omp parallel for
    for(i=0; i<prec/8; i++){
        x = (double)rand_r(&seed) / RAND_MAX;
                y = (double)rand_r(&seed) / RAND_MAX;
        if(x*x+y*y<1){
            counter++;
        } 

}

And the second one is using table of increment variables per process and at the end, sum of elements in this table is a result:

#pragma omp parallel private(seed, x, y, i , nproc)
{
    seed = 25234 + 17 * omp_get_thread_num();
    nproc = omp_get_thread_num();
    #pragma omp parallel for
    for(i=0; i<prec/8; i++){
        x = (double)rand_r(&seed) / RAND_MAX;
        y = (double)rand_r(&seed) / RAND_MAX;
        if(x*x+y*y<1){
            counter[nproc]++;
        } 

    }
}

double time = omp_get_wtime() - start_time;
int sum=0;
for(int i=0; i<8; i++){
    sum+=counter[i];

}

Theoretically, the second way should be faster, because processes are not sharing one variable, but every process has own variable. But when I calculate time of execution:

first approach: 3.72423 [s]

second approach: 8.94479[s]

Am I think wrong or am I do something wrong in my code?

`reduction` variables **are** private, which invalidates your argument. — Hristo Iliev, Nov 24 '15 at 10:36
I don't think this should be a duplicate the question is not the same even if the answer has the same conclusion. [exemple here](http://stackoverflow.com/questions/10315041/meaning-of-acronym-sso-in-the-context-of-stdstring) (surely not the best) — coincoin, Nov 24 '15 at 12:34
@coincoin, while it is not word-for-word the same question, the core premises are the same. Of course, anyone on SO is free to disagree and vote for reopening of the question. — Hristo Iliev, Nov 24 '15 at 16:06
@HristoIliev I don't really mind personnally it's fine ! Just by curiosity how can we vote for reopening the question ? I can't see that I guess I there is a minimum amount of points to have right ? — coincoin, Nov 24 '15 at 16:11
@coincoin, I really meant _almost_ anyone. You need 3000 points to be able to access the close/reopen voting - see [here](http://stackoverflow.com/help/privileges). — Hristo Iliev, Nov 24 '15 at 16:15

score 5 · Accepted Answer · answered Nov 24 '15 at 10:23

You are victim of false sharing in the second approach. Here an interesting article from Intel about that.

False sharing occurs when threads on different processors modify variables that reside on the same cache line. This invalidates the cache line and forces a memory update to maintain cache coherency.

If two processors operate on independent data in the same memory address region storable in a single line, the cache coherency mechanisms in the system may force the whole line across the bus or interconnect with every data write, forcing memory stalls in addition to wasting system bandwidth

Intuitively, I don't think the first approach should be slower.
You indeed create a private copy on each thread and then apply the final result in a global variable. The behaviour is somehow the same with your shared array, but the problem here is even if your accesses are independent, you get false sharing.

Calculating pi using Monte Carlo method in two ways in C++ OpenMP

1 Answers1