3

I am still confused. If i use the reduction clause in OpenMP can false sharing happen? (Both code snippets give the correct result.)

A little example, where the maximum of an array is wanted:

double max_red(double *A, int N){

double mx = std::numeric_limits<double>::min();
#pragma omp parallel for reduction(max:mx)
for(int i=0; i<N; ++i){
    if(A[i]>mx) mx = A[i];

}

return mx; 
}

This example can also written with extra padding

double max_padd(double *A, int N){

omp_set_num_threads(NUM_THREADS);
double local_max[NUM_THREADS][8];
double res;

#pragma omp parallel
{

    int id = omp_get_thread_num();
    local_max[id][0] = std::numeric_limits<double>::min();

    #pragma omp for
        for(int i=0; i<N; ++i){
            if(A[i]>local_max[id][0])local_max[id][0]=A[i];
    }
    #pragma omp single
    {
        res = local_max[0][0];
        for(int i=0; i<NUM_THREADS; ++i){
            if(local_max[i][0]> res)res = local_max[i][0];
        }
    }
}

return res;

}

But is the extra padding necessary for totally prohibit false sharing or is the reduction clause safe enough?

Thanks

Suslik
  • 929
  • 8
  • 28

1 Answers1

7

The padding is not necessary.

Technically this is not mandated by the standard. The standard doesn't say where each threads private copy is to be located in memory. Remember that false sharing is not a correctness issue, but a (very significant) practical performance issue.

However, it would be extremely surprising if any OpenMP implementation would make such a rookie mistake and place the private copies on the same cache line.

Assume that the the implementation has a better understanding of the platform and it's performance characteristics than the programmer. Only write manual "performance improvements" such as your second solution if you have proof by measurement that the idiomatic solution (e.g. your first solution) has bad performance that cannot be fixed by tuning.

Practical Note: I'm fairly sure implementations would typically place the private copy on the (private) stack of the executing thread, and afterwards each thread would update the shared variable with a critical section or atomically when supported.

In theory, a logarithmic time tree-based reduction is be possible. However implementations don't seem to do that, at least not in all cases.

Zulan
  • 21,896
  • 6
  • 49
  • 109
  • Zulan gave more than our money's worth in this answer. I would add that tree reduction for both parallel and simd floating point addition is in some widely used implementations but may not be carried over to max/min reduction. – tim18 Apr 04 '18 at 12:49
  • Some compilers may optimize simd reduction better with std::max (or min). I have seen llvm do this better than g++ or icpc. – tim18 Apr 04 '18 at 12:54