I am trying to optimize the performance in a parallel-for loop where I have a reduction variable (called delta) and I am wondering how this is handled under the hood by the OpenMP library.
Lets take as an example the following piece of code, where I simply declare the variable as a reduction one, at the beginning of the loop as follows:
#pragma omp parallel shared(delta, A, B, rows, colms) private(i, j)
.
.
.
#pragma omp for reduction(+:delta)
for (i=1; i<=rows; i++){
for (j=1; j<=colms; j++){
delta += fabs(A[i][j]- B[i][j]);
}
}
.
.
.
//end of parallel region
I am wondering whether during the calculation each thread sets a lock when accessing the delta variable and furthermore whether I could increase the performance by replacing delta variable with an array delta[number_of_threads], where each thread will write in a different position of the array during the calculation and then sum-up all the elements after the parallel region.