Does the OpenMP 4.5/5.0 standard support Fortran automatic arrays on GPU?

Question

I have tried compiling the following code with nvhpc/21.3 to run on an Nvidia v100, but the code bombs out. So nvhpc does not support fortran automatic arrays, but does the OpenMP standard support them?

        module test_mod
        contains
          subroutine saxpy(i,n,s)
          integer                 :: i,n
          real(8), parameter      :: p=0.5
          real(8), dimension(n,n) :: a,b,c ! automatic arrays
          real(8)                 :: s
!$omp     declare target
          a(i,:) = 1.0d0
          b(i,:) = 2.0d0
          c(i,:) = 0.0d0
          do j=1,n
                c(i,j) = a(i,j) + p*b(i,j)
          end do
          s=c(i,1)
          end subroutine
        end module
 
        program test
        use test_mod
        integer, parameter :: n=100
        integer            :: i
        real(8)            :: s
!$omp   target teams distribute parallel do map(from:s)
        do i=1,n
                call saxpy(i,n,s)
        end do
        print*,'%test_omp, ',s
        end program

How did you conclude that the compiler does not support them? What exactly happened? What was the error message? — Vladimir F Героям слава, Jun 04 '21 at 20:45
If by "nvhpc" you mean the nvidia Fortran compiler, then are you sure that that compiler supports OpenMP 4.5 GPU offload at all? What reason do you have to worry that OpenMP itself does not support declared targets as automatic arrays? — francescalus, Jun 04 '21 at 22:00
https://docs.nvidia.com/hpc-sdk/compilers/hpc-compilers-user-guide/index.html#openmp-subset might be of interest, purporting to detail " the subset of OpenMP 5.0 features that the HPC compilers support". Unfortunately the situation is a mess, I can't quickly find a definitive statement of what, if any, standard of OpenMP is supported in full. To say more please explain what you mean by "the code bombs out", as Vladimir asks — Ian Bush, Jun 05 '21 at 04:20
Also please understand that Real(8) is poor practice, not portable, may not do what you think, and might not be supported by all compilers - you may not use the nvhpc suite for ever. See https://stackoverflow.com/questions/838310/fortran-90-kind-parameter — Ian Bush, Jun 05 '21 at 04:25
We know that the compiler does not support them because the Nvidia compiler team we work with said that they don't. — rosenbe2, Jun 06 '21 at 18:36
./test %test_omp, running on GPU FATAL ERROR: FORTRAN AUTO ALLOCATION FAILED FATAL ERROR: FORTRAN AUTO ALLOCATION FAILED FATAL ERROR: FORTRAN AUTO ALLOCATION FAILED Fatal error: expression 'HX_CU_CALL_CHECK(p_cuStreamSynchronize(stream[dev]))' (value 1) is not equal to expression 'HX_SUCCESS' (value 0) — rosenbe2, Jun 06 '21 at 18:37

Jim Cownie · Answer 1 · 2021-06-08T07:48:31.507

1

does the OpenMP standard support fortran automatic arrays?

The OpenMP standard assumes underlying serial language standards, but does not require them.

The OpenMP standard itself has nothing to say about features like this which are unrelated to the changes in program behaviour specified by OpenMP. (Just as OpenMP says nothing about which format specifiers are legal :-)).

More generally, the OpenMP ARB has no tests or validation suites for OpenMP compliance, so perverse vendors could claim to support a particular level of the OpenMP standard while not really doing so... Caveat emptor!)

edited Jun 08 '21 at 07:48

answered Jun 07 '21 at 08:19

Jim Cownie

2,409
1
11
20

Well, that is not completely true, I remember that several features not working with OpenMP in existing compilers where explained by them not being supported (the exac behavior defined) in OpenMP 3. Associate and similar. And once GPUs come in to play, everything gets muddier. The OpenMP 4 or 5 standard could easily forbid certain features in device code. The question is whether they do. – Vladimir F Героям слава Jun 07 '21 at 10:35
So your question is not "Does OpenMP support them?", but "Does OpenMP forbid them?" that's rather a different question, and easier to answer, since if the relevant OpenMP standard says that it handles the appropriate level of the underlying language standard and does not explicitly forbid a feature then it should be supported. – Jim Cownie Jun 08 '21 at 08:27
@JimCrownie "My question" in which exact sense? If he original question at the top than no, I did not ask that question. If some new question I hypothetically asked in a comment than I was rather pointing to my interpretation to where a potential problem could lie rather than posing some actual specific question to you. – Vladimir F Героям слава Aug 30 '22 at 07:47
@VladimirFГероямслава I am just echoing what you said in your previous comment, ("The OpenMP 4 or 5 standard could easily forbid certain features in device code. The question is whether they do") and pointing out that that is different from the question you asked in the headline... – Jim Cownie Aug 30 '22 at 07:52
It starts with "Does..." and ends in a question mark. That looks like a question to me. – Jim Cownie Aug 31 '22 at 08:28
I say again for the third time, **I** did not aske the question. **rosenbe2** did ask it. – Vladimir F Героям слава Aug 31 '22 at 08:55
Ah, OK, sorry, got lost in the multiple threads here. – Jim Cownie Sep 01 '22 at 09:06

score 0 · Answer 2 · edited Aug 30 '22 at 10:35

Nvfortran does support automatics in device code. The problem here is that code is getting a heap overflow. The default heap size varies from device to device, but can be quite small. You can increase this via a call to cudaSetDeviceLimits or the environment variable NV_ACC_CUDA_HEAPSIZE.

% nvfortran -mp=gpu -fast test.f90 -V21.3 ; a.out
FATAL ERROR: FORTRAN AUTO ALLOCATION FAILED
FATAL ERROR: FORTRAN AUTO ALLOCATION FAILED
FATAL ERROR: FORTRAN AUTO ALLOCATION FAILED
Fatal error: expression 'HX_CU_CALL_CHECK(p_cuStreamSynchronize(stream[dev]))' (value 1) is not equal to expression 'HX_SUCCESS' (value 0)
Abort
% setenv NV_ACC_CUDA_HEAPSIZE 64MB
% a.out
 %test_omp,     2.000000000000000

However, in general it's not recommended to use automatics in device code. Besides the heap size limitation, allocation is serialized which can detrimentally impact performance. Better to use private arrays and pass them into the device subroutine.

I am dealing with a rather large Fortran 2008 code that uses automatic arrays throughout as well as type bound procedures. We have been struggling to port it to both OpenACC and OpenMP offloading. I will check out your suggestion. Thanks! — rosenbe2, Aug 30 '22 at 21:47

Does the OpenMP 4.5/5.0 standard support Fortran automatic arrays on GPU?

2 Answers2