1

I am trying to parallelize the following nested DO loop structure (the first code below) using 'Collapse' directive in OpenACC. The variable 'nbl' present in the outermost loop is present in the other DO loops, so there is dependency. Thanks to the compiler its showing an error in advance. So I had to compromise and construct 'collapse' directive only to the remaining four inner most loops. Is there a way to parallelize this loop to get maximum performance by utilizing the parallelism of "nbl = 1,nblocks" as well?

Compiler: pgfortran Flags: -acc -fast -ta=tesla:managed -Minfo=accel

Code that's giving error due to data dependency between outer most DO loop and other inner DO loops:

!$acc parallel loop collapse(5)
DO nbl = 1,nblocks
DO n_prim = 1,nprims
DO k = 1, NK(nbl)
DO j = 1, NJ(nbl)
DO i = 1, NI(nbl)

    Px(i,j,k,nbl,n_prim) = i*j + Cx(i,j,k,nbl,1)*Cx(i,j,k,nbl,5) + Cx(i,j,k,nbl,2)
    
ENDDO
ENDDO
ENDDO
ENDDO
ENDDO
!$acc end parallel loop

Compromised working code with lesser parllelism:

DO nbl = 1,nblocks
!$acc parallel loop collapse(4)
DO n_prim = 1,nprims
DO k = 1, NK(nbl)
DO j = 1, NJ(nbl)
DO i = 1, NI(nbl)

    Px(i,j,k,nbl,n_prim) = i*j + Cx(i,j,k,nbl,1)*Cx(i,j,k,nbl,5) + Cx(i,j,k,nbl,2)
    
ENDDO
ENDDO
ENDDO
ENDDO
!$acc end parallel loop
ENDDO

Thanks!

1 Answers1

1

The dependency is with the array look-ups for the upper bounds of the loops. In order to collapse loops, the iteration count of the loop must be known before entering, but here the count is variable.

Try something like the following and split the parallelism into two levels:

!$acc parallel loop collapse(2)
DO nbl = 1,nblocks
DO n_prim = 1,nprims
!$acc loop collapse(3)
DO k = 1, NK(nbl)
DO j = 1, NJ(nbl)
DO i = 1, NI(nbl)
Mat Colgrove
  • 5,441
  • 1
  • 10
  • 11
  • I would also add the `gang` clause on the outer loop (the one with `collapse(2)`) and the `vector` clause on the inner loop (`collapse(3)`) to make use of the different parallelism levels that OpenACC provides. – wyphan Sep 15 '21 at 17:50
  • 1
    While I can't tell for sure without looking at the compiler feedback messages (-Minfo=accel), it's likely the compiler is implicitly scheduling the outer loops as 'gang' and inner as 'vector'. I personally consider the loop schedule clauses as tuning only options so try to avoid using them (unless I'm tuning for particular architecture and need a different schedule than the default). This aids with performance portability if a different architecture prefers a different schedule , But they certainly are valid. – Mat Colgrove Sep 15 '21 at 18:15