reduction is very slow in OpenMP

Question

I am doing some optimization on my code. First i proceeded to a parallel programming with OpenMP Then i used the optimization flags provided by GNU GCC compiler. Also i included an SSE instruction to compute inverse square root. But i realized finally that the problem is that the last operation, when each thread writes the result into the reduction variable, takes ~ 80% of time. Here the parallel loop :

time(&t5);
# pragma omp parallel for shared(NTOT) private(dx,dy,d,H,V,E,F,G,K) reduction(+:dU)
for(j = 1; j <= NTOT; j++){
  if(!(j-i)) continue;
  dx = (X[2*j-2]-X[2*i-2])*a;
  dy = (X[2*j-1]-X[2*i-1])*a;
  d = rsqrtSSE(dx*dx+dy*dy);
  H = D*d*d*d;
  V = dS[0]*spin[2*j-2]+dS[1]*spin[2*j-1];
  E = dS[0]*dx+dS[1]*dy;
  F = spin[2*j-2]*dx+spin[2*j-1]*dy;
  G = -3*d*d*E*F;
  K = H*(V+G);
  dU += K;
}
time(&t6);
t_loop = difftime(t6, t5);

where rsqrtSSE() is a function based on __mm_rsqrt_ps(__m128 X) predefined function in xmmintrin.h . If there is a solution to overcome this problem? or this is due to bandwidth limitation?

i compile with gcc -o prog prog.c -lm -fopenmp -O3 - ffast-math -march=native

Here some infos about my computer : Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 2 Core(s) per socket: 2 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 69 Model name: Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz Stepping: 1 CPU MHz: 849.382 CPU max MHz: 2600.0000 CPU min MHz: 800.0000 BogoMIPS: 4589.17 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 3072K NUMA node0 CPU(s): 0-3

and with turboboost : CPU Avg_MHz %Busy Bzy_MHz TSC_MHz - 2294 99.97 2300 2295 0 2295 100.00 2300 2295 1 2295 100.00 2300 2295 2 2292 99.87 2300 2295 3 2295 100.00 2300 2295

A few things are missing from this otherwise very good question. How did you determine that the last line consumes 80% of time? What does time actually mean considering multiple threads? What is the performance of one thread VS. two threads on your dual-core system? How large is `NTOT`? As usually a [mcve] would help alot to address your question practically. — Zulan, Jun 01 '16 at 17:41
i determined that using time_t type (writing the differance in time) NTOT = 6400 and there is another loop which size is 6400X1000. the time ellapsed in this loop is 160 second with dU += K and 30 second without it. I can't write the whole code. it's 372 line. i can use 4 threads because it's two threads per core. Of course 2 better than 1 and 4 better than two. — Rami Zouari, Jun 01 '16 at 17:45
That is not a proper way to time that. Removing `dU += K` allows the compiler to optimize away the entire computation because the result is not used. That is certainly what is happening here. The four hardware threads of your CPU are not independant, each two threads share a core. You can experiment with 2/4 threads, but do not expect a speedup - it may be slightly slower / faster / equal either way. — Zulan, Jun 01 '16 at 17:59
i measured now the time of dU += K and it's about 1/4 the time of the whole loop. of course the code become slower because of time instructions but it still a big differance! about 4 or 2 threads, i experiances that and it's faster with 4 threads than 2. I think that we call it Hyper-threading?? when we have two cores and 2 threads per core — Rami Zouari, Jun 01 '16 at 18:16

score 1 · Answer 1 · edited May 23 '17 at 12:24

Your measurement is flawed.

The first approach, removing the line alltogether, allows the compiler to optimize away most of the computations, simply because the result is unused.

The second approach, if i understand it correctly, was to place timing instructins inside the loop itself, e.g. before/after dU += K. Unfortunately this, also has no hope of producing meaingful results. Any library timing call is orders of magintune slower than this operation. So you basically measure the time it takes to get the time. You can try that by repeating multiple timing calls and compare the difference.

From what I have seen, I suspect that the OpenMP implementations simply keep dU as thread-private variable and only after the loop is completed perform the reductions atomic/locked operations.

A better approach to determine the performance of individual lines is to use sampling. Take a look at the perf tool for linux, it can be very helpful. The results still may not always be perfectly reliable because there can be bias. Due to compiler optimizations (e.g. unrolling), and hardware optimizations (e.g. pipelining) multiple lines of code can be in the state of execution at the same time. It becomes a very difficult question to say how much time a line of code is taking.

One approach to see if you have a problem, is to try and figure out the theoretical hardware performance. For __mm_rsqrt_ps that seems difficult. You can also take a look at hardware performance counters to see if you have a lot of cache misses etc. perf can also help you with that.

Thank you for your response Zulan. I have linux tools already installed but i don't know how to use perf. is there a tutorial to explain that please? — Rami Zouari, Jun 01 '16 at 19:18
I can offer nothing better than to search for the two either. — Zulan, Jun 02 '16 at 07:18

reduction is very slow in OpenMP

1 Answers1