More on OpenMP optimisation, race conditions, and optimal number of threads & stacksize

Question

My problem is a fluid flow simulation but I will try to make the question as generic as possible. I have gone through the OpenMP API manual and OpenMP for F95. But as I am only 5-days old to multithreading, I seek your help after being baffled by the smorgasbord of options to optimise the code. I am using Intel Xeon CPU E5-2630 v4 @ 2.20GHz with one socket and 10 cores in that socket (with hyperthreading becoming 20 CPUs).

My whole simulation is basically filled with two kinds of nested loops as in (i) and (ii) below.

i) Where an array element (C(I,J,K) and D(I,J,K) below) depends on the previous K-1 grid point and hence I can't parallelise the outer most loop, e.g.,

Nx=256, Ny=209, Nz=64

DO K = 2,NY-1
!$OMP PARALLEL DO 
  DO J = 1, NZ  
   DO I = 1, NX/2+1
    C(I,J,K) =  C(I,J,K)/(A(I,J,K)*C(I,J,K-1))
    D(I,J,K) =  (D(I,J,K)-D(I,J,K-1))/(C(I,J,K-1))
   END DO
  END DO
!$OMP END PARALLEL DO
END DO

A(:,:,1:NY) is already calculated in a different subroutine and hence 
is available as a shared variable to the OpenMP threads.

ii) Where the update variable (A) do no depend on other grid points and hence I can parallelise all the loops, like the following:

!$OMP PARALLEL DO
 DO K = 1, NY
  DO J=1,NZ
   DO I=1,NX
    A(I,J,K)=(B(I,J,K)-B(I,J,K-1))/C(K-1)
   END DO
  END DO
 END DO
!$OMP END PARALLEL DO


B(:,:,1:NY) and C(:,:,1:NY) are already calculated in a different subroutine

Question (a): Do the above nested-loops have a race condition?

Question (b): The output is correct and matches the serial code, but:

b(i): are there any loopholes in the codes that can make them work incorrectly in certain situations?

b(ii): can the output be correct with a race condition?

Question (c): Are there any ways to optimise these code further? There are many options in the above-mentioned manuals, but some help on pointing me to the right direction would be highly appreciated.

I run the codes with

$ ulimit -s unlimited
$ export OMP_NUM_THREADS=16
$ gfortran -O3 mycode.f90 -fopenmp -o mycode

With 16 threads it takes about 80 time units while with 6, 10 and 20 # of threads it take 105, 101 and 100 time units.

Question (d): I know there could be many reasons for the above, but is there a thumb rule to follow on choosing the right number of threads (except hit-and-trial as somewhat implied in answers to this question)?

Question (e): Is ulimit -s unlimited a good option? (without it I get a segmentation fault (core dumped) error)

Thanks.

I do not see why you should get a stack-overflow error with your code when you do not have any private arrays. Anyway, I always recommend to use allocatable arrays, not playing with stack size. They are so much better than static ones... https://stackoverflow.com/a/37403568/721644 — Vladimir F Героям слава, Nov 29 '19 at 09:37
A second for Vladimir's comment about allocatable arrays. Also it may be more performant in the first case to create the threads just once outside the k loop, and then parallelise the j loop as you have - as a newcomer to threading please make sure you understand that thread creation and worksharing are seprate phases, and for this reason I don't teach my students the combined omp constructs that you are using - in fact I wish OMP had never introduced them — Ian Bush, Nov 29 '19 at 09:51

Max Langhof · Answer 1 · 2019-11-29T09:46:41.167

(a) You have a race condition only if multiple threads perform accesses to the same location (without synchronization) and at least one of those is a write.
- The second code snippet does not have a race condition because you only write to each location of A exactly once and never read from it.
- Similarly, the first code snippet does not have a race condition as you only read/write from/to each location in the K slices of C and D once and then don't read it again within the same parallel section (because K is fixed within each parallel region). Reading from the K-1 slice is of course not a race.
(b)
- (bi) Have you looked at the numerics? There seems to be a lot of room for catastrophic cancellation. But that's not threading-related.
- (bii) Theoretically, yes. Really depends on how egregious it is. There is no race condition here though.
(c) Profile! Are you memory bound or CPU bound (presumably the former)? Do you get a lot of cache misses? Is there false sharing (I doubt it)? Can you rearrange your data and/or loops to improve cache behavior? Are your strided accesses aligned critically? There are many of these kinds of performance gotchas, it'll take time and experience to understand and recognize them. My advice would be to become particularly familiar with the impact of caching, that's at the heart of most performance questions.
(d) If you care about performance, you have to profile and compare. Modern CPUs are so fiendishly complex that you have little chance to predict performance of any but the most trivial snippets. As mentioned in the answers you linked, if you're memory bound then more threads tend to make things worse, but your own results show that performance still improves by having slightly more than one thread per physical core (presumably because the divisions are a bit slow?).
(e) That sounds like you are allocating large arrays on the stack. That itself is probably a bad idea precisely because you tend to run out of stack space. But I'm not familiar with Fortran best practices so I can't help you much there.

More on OpenMP optimisation, race conditions, and optimal number of threads & stacksize

1 Answers1