slow-down when using OpenMP and calling subroutine in a loop

Question

Here I present a simple fortran code using OpenMP that calculate a summation of arrays multiple times. My computers has 6 cores with 12 threads and memory space of 16G.

There are two versions of this code. The first version has only 1 file test.f90 and the summation is implemented in this file. The code is presented as follows

program main
  implicit none

  integer*8 :: begin, end, rate
  integer i, j, k, ii, jj, kk, cnt
  real*8,allocatable,dimension(:,:,:)::theta, e

  allocate(theta(2000,50,5))
  allocate(e(2000,50,5))

  call system_clock(count_rate=rate)
  call system_clock(count=begin)

  !$omp parallel do
  do cnt = 1, 8
     do i = 1, 1001
        do j = 1, 50
           theta = theta+0.5d0*e
        end do
     end do       
  end do
  !$omp end parallel do

  call system_clock(count=end)
  write(*, *) 'total time cost is : ', (end-begin)*1.d0/rate

  deallocate(theta)
  deallocate(e)

end program main

This version has no problem on OpenMP and we can see acceleration.

The second version is modified such that the implementation of summation is written in a subroutine. There are two files, test.f90 and sub.f90 which are presented as follows

! test.f90
program main
  use sub
  implicit none

  integer*8 :: begin, end, rate
  integer i, j, k, ii, jj, kk, cnt

  call system_clock(count_rate=rate)
  call system_clock(count=begin)

  !$omp parallel do
  do cnt = 1, 8
    call summation()
  end do
  !$omp end parallel do

  call system_clock(count=end)
  write(*, *) 'total time cost is : ', (end-begin)*1.d0/rate

end program main

and

! sub.f90
module sub
  implicit none

contains

  subroutine summation()
    implicit none
    real*8,allocatable,dimension(:,:,:)::theta, e
    integer i, j

    allocate(theta(2000,50,5))
    allocate(e(2000,50,5))

    theta = 0.d0
    e = 0.d0

    do i = 1, 101
      do j = 1, 50
        theta = theta+0.5d0*e
      end do
    end do

    deallocate(theta)
    deallocate(e)

  end subroutine summation

end module sub

I also write a Makefile as follows

FC = ifort -O2 -mcmodel=large -qopenmp
LN = ifort -O2 -mcmodel=large -qopenmp

FFLAGS = -c
LFLAGS =

result: sub.o test.o
    $(LN) $(LFLAGS) -o result test.o sub.o

test.o: test.f90
    $(FC) $(FFLAGS) -o test.o test.f90

sub.o: sub.f90
    $(FC) $(FFLAGS) -o sub.o sub.f90

clean:
    rm result *.o*  *.mod *.e*

(we can use gfortran instead) However, we I run this version, there will be dramatic slow-down in using OpenMP and it is even much slower than the single-thread one (no OpenMP). So, what happened here and how to fix this ?

This is wrong. Every thread now calls the subroutine independently and every thread does all the work. — Vladimir F Героям слава, Nov 24 '17 at 18:56
Yes, I just find it. However, in real code, `theta` and `e` should be treated as private and it means I can never have speed-up from `OpenMP` because of the restriction of memory bandwidth. Is it correct ? — JunjieChen, Nov 24 '17 at 18:58
OK then. Just try to make the loop inside the subroutine parallel. This is why I already told you you should show the **real** code. — Vladimir F Героям слава, Nov 24 '17 at 19:00
I just cannot because in the inner loop, the update of `theta` and `e` depends on former step results (there are not independent). — JunjieChen, Nov 24 '17 at 19:02
You must show the **real code**. Otherwise we will just suggest things that will not help. — Vladimir F Героям слава, Nov 24 '17 at 19:50
the real code is too long to post here so I upload it to Github (https://github.com/AchillesJJ/RL_spin1), in which I use `MKL` for matrix diagonalization and no other external routine is needed. I shall add some comment there as soon as possible so that you can understand my code quickly. Thank you — JunjieChen, Nov 25 '17 at 03:39
I am not willing to read a whole long code for you, sorry. Probably others are the same way. If you really want help you must have a readable example code. — Ross, Nov 25 '17 at 05:41
you can see this former question (https://stackoverflow.com/questions/47474946/dramatic-slow-down-when-executing-multiple-processes-at-the-same-time) in which I give a typical and readable example code. If you can solve the problem for that, the real one shall be solved too. — JunjieChen, Nov 26 '17 at 05:46
I have tried the speed-up to some extent, but not with success... (use of daxpy() in BLAS gave a slight improvement by 10-20%, but very minor, and no improvement with openmp). Because many arrays are indexed via vector indexing (?) and also the outer loop (over "ep") seems not independent, it looks pretty hard to make simple parallelization (except for running independent jobs...which has the problem of the previous post). Hmm.. — roygvib, Nov 26 '17 at 11:17
yes, till now I have following conclusion : (1) frequent calculation of large array in a fast and tight loop may exceed the bandwidth of memory or cache of a single CPU (even with multiple cores in it) (2) in a work station with many sockets (4 sockets in mine), there are 4 CPU with shared memory space and private cache. Then multi-thread speed up is possible, as least for 4 threads that I have tried. — JunjieChen, Nov 26 '17 at 14:11

slow-down when using OpenMP and calling subroutine in a loop

0 Answers0

Linked