I compile a Fortran 90 code with mpif90 compiler with two different makefiles, the first one looks like;
FC = mpif90
FFLAGS = -Wall -ffree-line-length-none
FOPT = -O3
all: ParP2S.o ParP2S
ParP2S.o: ParP2S.f90
$(FC) $(FFLAGS) $(FOPT) ParP2S.f90 -c
ParP2S: ParP2S.o
$(FC) $(FFLAGS) $(FOPT) ParP2S.o -o ParP2S
clean:
rm -f *.o* rm -f *.o*
the second makefile looks very similar, I just added the -fopenmp flag;
FC = mpif90
FFLAGS = -Wall -ffree-line-length-none -fopenmp
FOPT = -O3
all: ParP2S.o ParP2S
ParP2S.o: ParP2S.f90
$(FC) $(FFLAGS) $(FOPT) ParP2S.f90 -c
ParP2S: ParP2S.o
$(FC) $(FFLAGS) $(FOPT) ParP2S.o -o ParP2S
clean:
rm -f *.o* rm -f *.o*
The second makefile is for a hybrid (MPI with OpenMP) version of the code. For now, I have exactly the same code but compiled with these different makefiles. In the second case, the code is more than 100 times slower. Any comments in what I am doing wrong?
edit 1: I am not running multi-threaded tasks. In fact, the code does not have any OpenMP directives, it is just the pure MPI code but compiled with a different makefile. Nevertheless, I did try running after the following commands (see below) and it didn't helped.
export MV2_ENABLE_AFFINITY=0
export OMP_NUM_THREADS=1
export OMP_PROC_BIND=true
mpirun -np 2 ./ParP2S
edit 2: I am using gcc version 4.9.2 (I know there was a bug with vectorization with fopenmp in an older version). I thought the inclusion of the -fopenmp flag could be inhibiting the compiler optimizations, however, after reading the interesting discussion (May compiler optimizations be inhibited by multi-threading?) I am not sure if this is the case. Furthermore, as my code does not have any OpenMP directives, I don't see why the code compiled with -fopenmp should be that slower.
edit3: When I run without -fopenmp (first makefile) it takes about 0.2 seconds without optimizations (-O0) and 0.08 seconds with optimizations (-O3), but including the flag -fopenmp it takes about 11 seconds with -O3 or -O0.