0

I parallelize a Fortran code with OpenMP. I'm still debugging (there may be some remaining bugs but sequential version is OK) and the behaviour of the OpenMP threads is strange, there are threads that do not wait the others at OMP BARRIER or OMP END DO statements.

For example, in one (simplified) subroutine one has :

SUBROUTINE calcT (T)

USE mod_param_simul
USE mod_transport

REAL(8), INTENT(INOUT), DIMENSION(0:nr+1,0:nth+1) :: T

INTEGER :: i, j
REAL(8), DIMENSION(:,:), SAVE, ALLOCATABLE :: Work

!$OMP SINGLE
ALLOCATE (Work(nr,nth) )
!$OMP END SINGLE

!$OMP DO COLLAPSE(2)
DO j = 1, nth
   DO i = 1, nr
      Work(i,j) = aTdiffW(i,j)*(T(i-1,j  )-T(i,j) ) + aTdiffE(i,j)*(T(i+1,j  )-T(i,j) ) &
                + aTdiffN(i,j)*(T(i  ,j+1)-T(i,j) ) + aTdiffS(i,j)*(T(i  ,j-1)-T(i,j) ) &
                + bsrc_Tdiff(i,j)
   END DO
END DO
!$OMP END DO

!$OMP DO COLLAPSE(2)
DO j = 1, nth
   DO i = 1, nr
      T(i,j) = T(i,j) + Work(i,j)
   END DO
END DO
!$OMP END DO

!$OMP SINGLE
DEALLOCATE (Work)
!$OMP END SINGLE

RETURN
END SUBROUTINE calcT

I already have checked a lot of things :

  • OpenMP visibility of all variables in the application
  • SAVE attribute for variables, especially arrays, defined in Fortran modules
  • bounds of arrays, of loop indexes
  • dynamic allocation of arrays
  • use valgrind -- memcheck to monitor the code ; nothing relevant found
  • use recent releases of several Fortran compilers (Intel ifort, GNU gfortran, PGI pgf90) with full debugging options. I get crash of the code in the second loop nest (write/flush statements inserted before/after), updating array T with error messages like

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

or

forrtl: severe (408): fort: (8): Attempt to fetch from allocatable variable WORK when it is not allocated

Edit : I add more informations

Intel ifort  2021.3.0 20210609 ; options : -fpp  -stand f08 -e08 -free -g -O0 -init=huge -init=snan -init=arrays -ftrapuv -debug all -check all -warn general -implicitnone -no-ftz -fpe0 -fp-model precise     -traceback -fno-alias -fno-fnalias -O0 -qopenmp

Runtime error :

forrtl: severe (408): fort: (8): Attempt to fetch from allocatable variable WORK when it is not allocated

Image              PC                Routine            Line        Source             
a.out       000000000074806F  Unknown               Unknown  Unknown
a.out       000000000043EC20  mod_energy_mp_calct         529  energy.f08

and line 529 is just before line 531 which is T(i,j) = T(i,j) + Work(i,j)

GNU Fortran (GCC) 10.3.0 options : -x f95-cpp-input -std=f2008 -pedantic -ffree-form -g -fbacktrace -ffree-line-length-none -fcheck=all -finit-real=snan -ffpe-trap=invalid,zero,overflow -fimplicit    -none -Wimplicit-interface -Wconversion -Wunderflow -Wall -Wextra -fbounds-check -O0 -fopenmp

Runtime error :

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0x7f0741fb53ff in ???
#1  0x40e667 in __mod_energy_MOD_calct
    at /home/........../energy.f08:531
#2  0x46dae7 in MAIN__._omp_fn.0
    at /home/........../main.f08:290
#3  0x7f07429da091 in GOMP_parallel
    at ../../../gcc-10.3.0/libgomp/parallel.c:171
#4  0x4177e6 in MAIN__
    at /home/........../main.f08:81
#5  0x46d653 in main
    at /home/........../main.f08:16

Moreover, Nr and Nth are declared as INTEGER with SAVE attribute in a module that is used in this routine. The dimensions of T, (0:Nr+1,0:Nth+1), are for the inner domain and the boundary cells. T is updated in its inner domain thanks to array Work that is computed with all values of T, boundary conditions included.

The result is always the same. It can occur in this routine or in others with very similar constructions. I do not see what is wrong with respect to the OpenMP norm.

So, I'd be grateful if someone can give me some advices because I ran out of idea.

Regards,

M_Guy
  • 1
  • 2
  • 1
    Please create a full compilable and testable program that we can test ourselves ([mcve]). – Vladimir F Героям слава Nov 18 '21 at 14:52
  • 1
    Why do you think the problem is threads not respecting the synchronisations points? You have given no evidence that this is the case in the above. – Ian Bush Nov 18 '21 at 15:21
  • Several questions : why do you allocated work with $omp singe ? Where is located the $omp parallel instruction ? In the procedure calling calcT ? – Francois Jacq Nov 18 '21 at 20:42
  • @Vladimir F, I tried this but the example behaves correctly, so I suspect something wrong in my code but I don't know how to trap it. I'm not allowed to post/share the whole code. – M_Guy Nov 19 '21 at 08:42
  • @Ian Bush, the backtrace (routine names and line numbers) given by debugging options and the second error message in the same crash, make me think that one thread goes to the deallocate instruction too early. – M_Guy Nov 19 '21 at 08:43
  • 1
    @FrancoisJacq These are presumably orphaned directives. The single is required because the Save makes `work` shared (but I want to find time to absolutely check in the standard about this specific case). But a minimal, reproducible example would make working out what is going wrong much easier. – Ian Bush Nov 19 '21 at 08:45
  • @M_Guy BTW note `real( 8 )` is not portable, not guaranteed to be supported by your compiler, and may not do what you want. See https://stackoverflow.com/questions/838310/fortran-90-kind-parameter – Ian Bush Nov 19 '21 at 08:46
  • @M_Guy Please edit the question to show what the debugging options are telling you – Ian Bush Nov 19 '21 at 08:47
  • @Francois Jacq. When I got the code, Work was an automatic array, REAL(8), DIMENSION(nr,nth) :: Work that can't be directly shared in an OpenMP way. So I changed it as an allocatable array like this ; ``` REAL(8), DIMENSION(:,:), SAVE, ALLOCATABLE :: Work ``` Here we are in the dynamic extent ; the $omp parallel instruction is located in the main, which calls a routine and that one calls calcT. – M_Guy Nov 19 '21 at 08:57
  • @Ian Bush, I simplify too much this example, I already use REAL(KIND=REAL64) but thanks for the link. – M_Guy Nov 19 '21 at 12:29
  • Nr and Nth do not appear known? And the T array is from 0:nr+1, which seems a bit odd. – Holmz Nov 20 '21 at 13:43

0 Answers0