I am doing some optimization on my code. First i proceeded to a parallel programming with OpenMP Then i used the optimization flags provided by GNU GCC compiler. Also i included an SSE instruction to compute inverse square root. But i realized finally that the problem is that the last operation, when each thread writes the result into the reduction variable, takes ~ 80% of time. Here the parallel loop :
time(&t5);
# pragma omp parallel for shared(NTOT) private(dx,dy,d,H,V,E,F,G,K) reduction(+:dU)
for(j = 1; j <= NTOT; j++){
if(!(j-i)) continue;
dx = (X[2*j-2]-X[2*i-2])*a;
dy = (X[2*j-1]-X[2*i-1])*a;
d = rsqrtSSE(dx*dx+dy*dy);
H = D*d*d*d;
V = dS[0]*spin[2*j-2]+dS[1]*spin[2*j-1];
E = dS[0]*dx+dS[1]*dy;
F = spin[2*j-2]*dx+spin[2*j-1]*dy;
G = -3*d*d*E*F;
K = H*(V+G);
dU += K;
}
time(&t6);
t_loop = difftime(t6, t5);
where rsqrtSSE() is a function based on __mm_rsqrt_ps(__m128 X) predefined function in xmmintrin.h . If there is a solution to overcome this problem? or this is due to bandwidth limitation?
i compile with gcc -o prog prog.c -lm -fopenmp -O3 - ffast-math -march=native
Here some infos about my computer : Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 2 Core(s) per socket: 2 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 69 Model name: Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz Stepping: 1 CPU MHz: 849.382 CPU max MHz: 2600.0000 CPU min MHz: 800.0000 BogoMIPS: 4589.17 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 3072K NUMA node0 CPU(s): 0-3
and with turboboost : CPU Avg_MHz %Busy Bzy_MHz TSC_MHz - 2294 99.97 2300 2295 0 2295 100.00 2300 2295 1 2295 100.00 2300 2295 2 2292 99.87 2300 2295 3 2295 100.00 2300 2295