0

I need to implement reduction operation (for each thread the value should be stored in different array entry). However, it runs slower for more threads. Any suggestions?

double local_sum[16];.
//Initializations....
#pragma omp parallel for shared(h,n,a) private(x, thread_id) 
for (i = 1; i < n; i++) {
    thread_id = omp_get_thread_num();
    x = a  + i* h;
    local_sum[thread_id] += f(x);
}
romants
  • 3,660
  • 1
  • 21
  • 33

2 Answers2

5

You are experiencing the effects of false sharing. On x86 a single cache line is 64 bytes long and therefore holds 64 / sizeof(double) = 8 array elements. When one thread updates its element, the core that it runs on uses the cache coherency protocol to invalidate the same cache line in all other cores. When another thread updates its element, instead or operating directly on the cache, its core has to reload the cache line from an upper-level data cache or from the main memory. This significantly slows down the program execution.

The simplest solution is to insert padding and thus spread array elements accessed by different threads into distinct cache lines. On x86 that would be 7 double elements. Therefore your code should look like:

double local_sum[8*16];
//Initializations....
#pragma omp parallel for shared(h,n,a) private(x, thread_id) 
for (i = 1; i < n; i++) {
    thread_id = omp_get_thread_num();
    x = a  + i* h;
    local_sum[8*thread_id] += f(x);

}

Don't forget to take only each 8th element when summing the array at the end (or initialise all array elements to zero).

Hristo Iliev
  • 72,659
  • 12
  • 135
  • 186
  • Wouldn't it be better to keep the original 16 element array and instead use private local partial sum reductions in the parallel loop and then fill the 16 element array outside of the parallel loop but in the parallel block. This still has false sharing but the impact is negligible since the array is only hit once per thread and not once per iteration and in addition you don't have to worry about pages on NUMA systems. – Z boson Feb 04 '14 at 12:58
  • 1
    It would be better but it won't have the educational value of teaching the OP about false sharing. NUMA related optimisations come later :) – Hristo Iliev Feb 04 '14 at 13:26
-2

Did you try to use reduction?

double global_sum = 0.0;
#pragma omp parallel for shared(h,n,a) reduction(+:global_sum) 
for (i = 1; i < n; i++) {
    global_sum += f(a  + i* h);
}

Howerver there may be a lot of other reasons why it runs slow. For example you should not create 16 threads if you have only 2 CPU cores and so on.