My problem is a fluid flow simulation but I will try to make the question as generic as possible. I have gone through the OpenMP API manual and OpenMP for F95. But as I am only 5-days old to multithreading, I seek your help after being baffled by the smorgasbord of options to optimise the code. I am using Intel Xeon CPU E5-2630 v4 @ 2.20GHz with one socket and 10 cores in that socket (with hyperthreading becoming 20 CPUs).
My whole simulation is basically filled with two kinds of nested loops as in (i) and (ii) below.
i) Where an array element (C(I,J,K)
and D(I,J,K)
below) depends on the previous K-1
grid point and hence I can't parallelise the outer most loop, e.g.,
Nx=256, Ny=209, Nz=64
DO K = 2,NY-1
!$OMP PARALLEL DO
DO J = 1, NZ
DO I = 1, NX/2+1
C(I,J,K) = C(I,J,K)/(A(I,J,K)*C(I,J,K-1))
D(I,J,K) = (D(I,J,K)-D(I,J,K-1))/(C(I,J,K-1))
END DO
END DO
!$OMP END PARALLEL DO
END DO
A(:,:,1:NY) is already calculated in a different subroutine and hence
is available as a shared variable to the OpenMP threads.
ii) Where the update variable (A
) do no depend on other grid points and hence I can parallelise all the loops, like the following:
!$OMP PARALLEL DO
DO K = 1, NY
DO J=1,NZ
DO I=1,NX
A(I,J,K)=(B(I,J,K)-B(I,J,K-1))/C(K-1)
END DO
END DO
END DO
!$OMP END PARALLEL DO
B(:,:,1:NY) and C(:,:,1:NY) are already calculated in a different subroutine
Question (a): Do the above nested-loops have a race condition?
Question (b): The output is correct and matches the serial code, but:
b(i): are there any loopholes in the codes that can make them work incorrectly in certain situations?
b(ii): can the output be correct with a race condition?
Question (c): Are there any ways to optimise these code further? There are many options in the above-mentioned manuals, but some help on pointing me to the right direction would be highly appreciated.
I run the codes with
$ ulimit -s unlimited
$ export OMP_NUM_THREADS=16
$ gfortran -O3 mycode.f90 -fopenmp -o mycode
With 16 threads it takes about 80 time units while with 6, 10 and 20 # of threads it take 105, 101 and 100 time units.
Question (d): I know there could be many reasons for the above, but is there a thumb rule to follow on choosing the right number of threads (except hit-and-trial as somewhat implied in answers to this question)?
Question (e): Is
ulimit -s unlimited
a good option? (without it I get asegmentation fault (core dumped)
error)
Thanks.