OpenMP reduction of large arrays in Fortran

Question

I know that similar questions to this have been asked sometimes: Openmp array reductions with Fortran, Reducing on array in OpenMP, even in Intel forums (https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures/topic/345415) but I would like to know your opinion because the scalability that I get is not the one that I expect.

So I need to fill a really large array of complex numbers, which I would like to parallelize with OpenMP. Our first approach is this one:

COMPLEX(KIND=DBL), ALLOCATABLE :: huge_array(:)
COMPLEX(KIND=DBL), ALLOCATABLE :: thread_huge_array(:)
INTEGER :: huge_number, index1, index2, index3, index4, index5, bignumber1, bignumber2, smallnumber, depending_index

ALLOCATE(huge_array(huge_number))

!$OMP PARALLEL FIRSTPRIVATE(thread_huge_array)
      ALLOCATE(thread_huge_array(SIZE(huge_array)))
      thread_huge_array = ZERO
!$OMP DO
      DO index1=1,bignumber1
         ! Some calculations
         DO index2=1,bignumber2
            ! Some calculations
            DO index3=1,6
               DO index4=1,6
                  DO index5=1,smallnumber
                     depending_index = function(index1, index2, index3, index4, index5)
                     thread_huge_array(depending_index) = thread_huge_array(depending_index)
                  ENDDO
               ENDDO 
            ENDDO
         ENDDO 
      ENDDO 
!$OMP END DO
!$OMP BARRIER
!$OMP MASTER
      huge_array = ZERO
!$OMP END MASTER
!$OMP CRITICAL
      huge_array = huge_array + thread_huge_array
!$OMP END CRITICAL
      DEALLOCATE(thread_huge_array)
!$OMP END PARALLEL

So, with that approach, we get good scalability until 8 cores, reasonable scalability until 32 cores and from 40 cores, it is slower than with 16 cores (we have a machine with 80 physical cores). Of course, we cannot use REDUCTION clause because the size of the array is so big that it doesn't fit in the stack (even increasing ulimit to the maximum allowed in the machine).

We have tried a different approach with this one:

COMPLEX(KIND=DBL), ALLOCATABLE :: huge_array(:)
COMPLEX(KIND=DBL), POINTER:: thread_huge_array(:)
INTEGER :: huge_number

ALLOCATE(huge_array(huge_number))

ALLOCATE(thread_huge_array(SIZE(huge_array),omp_get_max_threads()))
thread_huge_array = ZERO

!$OMP PARALLEL PRIVATE (num_thread)

      num_thread = omp_get_thread_num()+1
!$OMP DO
      DO index1=1,bignumber1
         ! Some calculations
         DO index2=1,bignumber2
            ! Some calculations
            DO index3=1,6
               DO index4=1,num_weights_sp
                  DO index5=1,smallnumber
                     depending_index = function(index1, index2, index3, index4, index5)
                     thread_huge_array(depending_index, omp_get_thread_num()) = thread_huge_array(depending_index, omp_get_thread_num())
                  ENDDO
               ENDDO 
            ENDDO
         ENDDO 
      ENDDO 
!$OMP END DO
!$OMP END PARALLEL

huge_array = ZERO

DO index_ii = 1,omp_get_max_threads()
   huge_array = huge_array + thread_huge_array(:,index_ii)
ENDDO

DEALLOCATE(thread_huge_array)

DEALLOCATE(huge_array)

And in this last case, we obtain longer times for the method (due to the allocation of the memory, which is much bigger) and worse relative acceleration.

Can you provide some hints to achieve a better acceleration? Or is it impossible with these huge arrays with OpenMP?

Please show some actual timings and the hardware (what kind of 80 cores? in what kind of CPU sockets?). How many NUMA nodes? Not that the code is not compilable so we can't do any tests ourselves. — Vladimir F Героям слава, May 25 '19 at 12:28
Note the first form above has a bug in it as master is not a worksharing construct, and so does not have an implicit barrier at the end. On the other hand do is one, so you don't need the barrier after it, as the end of the construct provides one automatically. — Ian Bush, May 25 '19 at 17:34
Also in the second case why do you not paralleise the final addition? this is trivially parallel over the first index of thread_huger_array — Ian Bush, May 25 '19 at 17:35
Vladimir, I am using a machine with 80 CPUs, 20 cores per socket and 4 NUMA nodes. Also, the timing for a real case and different threads is: 1 thread 132.90 8 threads 17.58 (0.42) | 19.22 (0.37) 20 threads 8.33 (0.951) | 9.522 (0.785) 40 threads 5.89 (1.802) | 7.79 (1.58) 80 threads 6.748 (3.597) | 10.188 (3.27) First result is with the first code, second result is with the serial reduction. You see that the time spent in reduction is so big that the acceleration is worse. — aamorm, May 27 '19 at 08:28
Ian, thanks so much for noticing. I have edited my code (before the timing was good but the routine was not thread-safe... my bad). Also, I cannot parallelise the final addition because the array is too large (the size for a real case may be 117273600 positions) and it doesn't fit in the stack. — aamorm, May 27 '19 at 08:36
In the second version rather than use array syntax just write a loop over the elements of huge_array and add in the thread versions. Each element of huge_array in this loop is independent, so it can be parallelised. Further you don't have to loop over huge_array twice, once to zero it, once to add in. You can do it all in this loop. — Ian Bush, May 27 '19 at 09:10
Indeed, I didn't notice the initialization that and it's a really good idea, thanks a lot for that! And I could parallelize it trivially with MPI, but with OpenMP and I don't see how since I think you need, either to store the whole array for each thread (declaring huge_array as threadprivate), either to launch threads for each element of the array, using an intermediate variable (and that kills the performance). — aamorm, May 27 '19 at 11:41

OpenMP reduction of large arrays in Fortran

0 Answers0