OpenACC | Fortran 90: What is the best way to parallelize nested DO loop?

Question

I am trying to parallelize the following nested DO loop structure (the first code below) using 'Collapse' directive in OpenACC. The variable 'nbl' present in the outermost loop is present in the other DO loops, so there is dependency. Thanks to the compiler its showing an error in advance. So I had to compromise and construct 'collapse' directive only to the remaining four inner most loops. Is there a way to parallelize this loop to get maximum performance by utilizing the parallelism of "nbl = 1,nblocks" as well?

Compiler: pgfortran Flags: -acc -fast -ta=tesla:managed -Minfo=accel

Code that's giving error due to data dependency between outer most DO loop and other inner DO loops:

!$acc parallel loop collapse(5)
DO nbl = 1,nblocks
DO n_prim = 1,nprims
DO k = 1, NK(nbl)
DO j = 1, NJ(nbl)
DO i = 1, NI(nbl)

    Px(i,j,k,nbl,n_prim) = i*j + Cx(i,j,k,nbl,1)*Cx(i,j,k,nbl,5) + Cx(i,j,k,nbl,2)
    
ENDDO
ENDDO
ENDDO
ENDDO
ENDDO
!$acc end parallel loop

Compromised working code with lesser parllelism:

DO nbl = 1,nblocks
!$acc parallel loop collapse(4)
DO n_prim = 1,nprims
DO k = 1, NK(nbl)
DO j = 1, NJ(nbl)
DO i = 1, NI(nbl)

    Px(i,j,k,nbl,n_prim) = i*j + Cx(i,j,k,nbl,1)*Cx(i,j,k,nbl,5) + Cx(i,j,k,nbl,2)
    
ENDDO
ENDDO
ENDDO
ENDDO
!$acc end parallel loop
ENDDO

Thanks!

I don't know openacc but if it follows openmp you won't be able to use the collapse construct for the reasons you give. However in openacc can you separate the "acc creation" and the worksharing? If so you could put the whole loop next on the device, but just parallelise the inner loops, which may well be enough to get good performance. — Ian Bush, Jul 26 '21 at 07:39
You might also look at https://stackoverflow.com/questions/28482833/understanding-the-collapse-clause-in-openmp which might help, but I can't see it directly at the moment. — Ian Bush, Jul 26 '21 at 07:40
If all the elements of `ni` are the same, why not use `do i=1,ni(1)` which removes the data dependency on `nbl`? — francescalus, Jul 26 '21 at 12:37

score 1 · Accepted Answer · answered Jul 26 '21 at 15:12

1

The dependency is with the array look-ups for the upper bounds of the loops. In order to collapse loops, the iteration count of the loop must be known before entering, but here the count is variable.

Try something like the following and split the parallelism into two levels:

!$acc parallel loop collapse(2)
DO nbl = 1,nblocks
DO n_prim = 1,nprims
!$acc loop collapse(3)
DO k = 1, NK(nbl)
DO j = 1, NJ(nbl)
DO i = 1, NI(nbl)

answered Jul 26 '21 at 15:12

Mat Colgrove

5,441
1
10
11

I would also add the `gang` clause on the outer loop (the one with `collapse(2)`) and the `vector` clause on the inner loop (`collapse(3)`) to make use of the different parallelism levels that OpenACC provides. – wyphan Sep 15 '21 at 17:50
1

While I can't tell for sure without looking at the compiler feedback messages (-Minfo=accel), it's likely the compiler is implicitly scheduling the outer loops as 'gang' and inner as 'vector'. I personally consider the loop schedule clauses as tuning only options so try to avoid using them (unless I'm tuning for particular architecture and need a different schedule than the default). This aids with performance portability if a different architecture prefers a different schedule , But they certainly are valid. – Mat Colgrove Sep 15 '21 at 18:15

OpenACC | Fortran 90: What is the best way to parallelize nested DO loop?

1 Answers1