MPI code much slower when compiling with -fopenmp flag (for MPI with multi-thread)

Question

I compile a Fortran 90 code with mpif90 compiler with two different makefiles, the first one looks like;

FC = mpif90
FFLAGS = -Wall -ffree-line-length-none 
FOPT = -O3

all: ParP2S.o ParP2S
ParP2S.o: ParP2S.f90
        $(FC) $(FFLAGS) $(FOPT) ParP2S.f90 -c
ParP2S: ParP2S.o
        $(FC) $(FFLAGS) $(FOPT) ParP2S.o -o ParP2S
clean: 
        rm -f *.o* rm -f *.o*

the second makefile looks very similar, I just added the -fopenmp flag;

FC = mpif90
FFLAGS = -Wall -ffree-line-length-none -fopenmp
FOPT = -O3

all: ParP2S.o ParP2S
ParP2S.o: ParP2S.f90
        $(FC) $(FFLAGS) $(FOPT) ParP2S.f90 -c
ParP2S: ParP2S.o
        $(FC) $(FFLAGS) $(FOPT) ParP2S.o -o ParP2S
clean: 
        rm -f *.o* rm -f *.o*

The second makefile is for a hybrid (MPI with OpenMP) version of the code. For now, I have exactly the same code but compiled with these different makefiles. In the second case, the code is more than 100 times slower. Any comments in what I am doing wrong?

edit 1: I am not running multi-threaded tasks. In fact, the code does not have any OpenMP directives, it is just the pure MPI code but compiled with a different makefile. Nevertheless, I did try running after the following commands (see below) and it didn't helped.

export MV2_ENABLE_AFFINITY=0
export OMP_NUM_THREADS=1
export OMP_PROC_BIND=true
mpirun -np 2 ./ParP2S

edit 2: I am using gcc version 4.9.2 (I know there was a bug with vectorization with fopenmp in an older version). I thought the inclusion of the -fopenmp flag could be inhibiting the compiler optimizations, however, after reading the interesting discussion (May compiler optimizations be inhibited by multi-threading?) I am not sure if this is the case. Furthermore, as my code does not have any OpenMP directives, I don't see why the code compiled with -fopenmp should be that slower.

edit3: When I run without -fopenmp (first makefile) it takes about 0.2 seconds without optimizations (-O0) and 0.08 seconds with optimizations (-O3), but including the flag -fopenmp it takes about 11 seconds with -O3 or -O0.

Please edit your question and add how you start your app (you are comparing flat n MPI tasks with hybrid NxM run). If you are running more than one OpenMP thread per task, it is critical the MPI task is **not** pinned on a single core. FWIW, `mpirun --oversubscribe ...` is worth trying on Open MPI. — Gilles Gouaillardet, Oct 05 '19 at 01:29
Check which core is each thread running at. Try to check the core occupations using `top`. — Vladimir F Героям слава, Oct 05 '19 at 21:20
I am not using multi-thread, there is no OpenMP directives in my code. — MCGuimaraes, Oct 05 '19 at 21:38
Unless you can post a [MCVE], I suggest you try an other compiler and/or profile your app to figure out where the time is spent in OpenMP “mode”. If you can run your program with one MPI task and without `mpirun` then you might want to try that as well. — Gilles Gouaillardet, Oct 06 '19 at 01:53

score 0 · Accepted Answer · answered Oct 06 '19 at 18:00

It turned out that the problem was really task affinity, as suggested by Vladimir F and Gilles Gouaillardet (thank you very much!).

First I realized I was running MPI with OpenMPI version 1.6.4 and not MVAPICH2, so the command export MV2_ENABLE_AFFINITY=0 has no real meaning here. Second, I was (presumably) taking care of the affinity of different OpenMP threads by setting

export OMP_PROC_BIND=true
export OMP_PLACES=cores

but I was not setting the correct bindings for the MPI processes, as I was incorrectly launching the application as

mpirun -np 2 ./Par2S

and it seems that, with OpenMPI version 1.6.4, a more appropriate way to do it is

mpirun -np 2 -bind-to-core -bycore -cpus-per-proc 2  ./hParP2S

The command -bind-to-core -bycore -cpus-per-proc 2 assures 2 cores for my application (see https://www.open-mpi.org/doc/v1.6/man1/mpirun.1.php and also Gilles Gouaillardet's comments on Ensure hybrid MPI / OpenMP runs each OpenMP thread on a different core). Without it, both MPI processes were going to one single core, which was the reason for the poor efficiency of the code when the flag -fopenmp was used in the Makefile.

Apparently, when running pure MPI code compiled without the -fopenmp flag different MPI processes go automatically to different cores, but with -fopenmp one needs to specify the bindings manually as described above.

As a matter of completeness, I should mention that there is no standard for setting the correct task affinity, so my solution will not work on e.g. MVAPICH2 or (possibly) different versions of OpenMPI. Furthermore, running nproc MPI processes with nthreads each in ncores CPUs would require e.g.

export OMP_PROC_BIND=true
export OMP_PLACES=cores
export OMP_NUM_THREADS=nthreads

mpirun -np nproc -bind-to-core -bycore -cpus-per-proc ncores ./hParP2S

where ncores=nproc*nthreads.

ps: my code has an MPI_all_to_all. The condition where more than one MPI process are on one single core (no hyperthreading) calling this subroutine should be the reason why the code was about 100 times slower.

if MPI tasks are given overlapping set of cores (in this case, all the cores), there is indeed the risk the OpenMP runtime will end up pining different MPI tasks to the same core (in this case, all MPI tasks are pinned on the first core). Binding a MPI+OpenMP app is really a two steps tango : have MPI assign non overlapping cores to the MPI tasks, and then let the OpenMP runtime bind its threads into their assigned set of cores. — Gilles Gouaillardet, Oct 07 '19 at 00:04

MPI code much slower when compiling with -fopenmp flag (for MPI with multi-thread)

1 Answers1