How to optimize this Fortran subroutine with many loops?

Question

I am dealing with subroutine that is very inefficient when the array size becomes large, for example, NN=1000, KK=200, MM = 200. But, I can not come up with ideas to optimize it.

program main

  implicit none

  integer :: NN, KK, MM
  integer, allocatable, dimension(:,:) :: id
  complex*16, allocatable, dimension(:) :: phase
  complex*16 :: phase_base(3)
  real*8, allocatable, dimension(:,:) :: wave_base

  complex*16, allocatable, dimension(:,:) :: wave
  integer :: i, j, k, n

  NN = 1000
  KK = 200
  MM = 200

  allocate(id(MM,3))
  allocate(phase(KK))
  allocate(wave_base(KK, NN*(NN+1)/2 ))
  allocate(wave(NN, NN))

  id(:,:) = 2

  phase_base(:) = (1.0d0,1.0d0)

  wave_base(:,:) = 1.0d0

  phase(:) = (1.0d0,1.0d0)

  call  noise_wave(NN, KK, MM, id, phase, phase_base, wave_base, wave)

  deallocate(id)
  deallocate(phase)
  deallocate(wave_base)
  deallocate(wave)

end program main

subroutine noise_wave(NN, KK, MM, id, phase_1, phase_base, wave_base, wave)
  implicit none

  integer, intent(in) :: NN, KK, MM
  integer, intent(in), dimension(MM, 3) :: id
  complex*16, intent(in) :: phase_1(KK)
  complex*16, intent(in) :: phase_base(3)
  real*8,  intent(in) :: wave_base(KK, NN*(NN+1)/2 )

  complex*16, intent(out) :: wave(NN, NN)

  integer :: i, j, k, p, n
  integer :: x, y, z
  real :: start, finish
  complex*16 :: phase_2, phase_2_conjg

  do p = 1, MM

    x = id(p, 1)
    y = id(p, 2)
    z = id(p, 3)

    phase_2 = (phase_base(1) ** x) * (phase_base(2) ** y) * (phase_base(3) ** z)

    phase_2_conjg = conjg(phase_2)

    n = 0
    do j = 1, NN
      do i = 1, j   ! upper triangle

        n = n + 1

        do k = 1, KK

          wave(i,j) = wave(i,j) + wave_base(k,n) * phase_1(k) * phase_2_conjg

        enddo

        wave(j,i) = conjg(wave(i,j) )

      enddo
    enddo
  enddo

end subroutin

Could someone give me some hint? (I have fulfill the suggested optimizations. Also, following Ian's suggestion, I have added a small test. Thus you can test it directly.)

This would be so much easier to answer if you supplied a minimal reproducible example (see https://stackoverflow.com/help/minimal-reproducible-example) That way we could test ideas rather than just (intelligently) guess — Ian Bush, Apr 12 '20 at 17:14

High Performance Mark · Answer 1 · 2020-04-12T17:08:22.603

3

You might get a measurable speedup if you change your loop nest to

  do p = 1, MM

     x = id(p, 1)
     y = id(p, 2)
     z = id(p, 3)
     phase = (phase_base(1) ** x) * (phase_base(2) ** y) * (phase_base(3) ** z)
     conjg_phase = conjg(phase)  ! new variable, calculate here, use below

     n = 0
     do j = 1, NN
        do i = 1, j   
           n = n + 1
           do k = 1, KK
              wave(i,j) = wave(i,j) + wave_base(k,n) * conjg_phase
           enddo
        enddo
        wave(j,i) = conjg(wave(i,j) )
     enddo
  enddo

(and it might still be correct if I've understood the code !). Even little computations like the ones I've lifted out of the bottom of the loop nest are a drag if repeated often enough. And the execution speed might benefit from moving those values in and out of cache less often too.

It might be worth (a little) swapping the dimensions of id, then reading id(1:3,p) is likely to be more cache-friendly than the current version.

And if the execution speed is still not to your taste, time to learn OpenMP (if you don't know it already).

edited Apr 12 '20 at 17:08

answered Apr 12 '20 at 17:03

High Performance Mark

77,191
7
105
161

1

I don't think you need the p loop to be around the main loop nest - can't you just sum the conjg_phase's and apply the summed value? – Ian Bush Apr 12 '20 at 17:12
Thanks for the nice idea, I have tested your reorganization. It gains massive speedup. @High Performance Mark – Memories Apr 12 '20 at 17:30
1

MY other point is that real*8 and complex*16 are not, and have never been part of standard Fortran and should be avoided - please see https://stackoverflow.com/questions/838310/fortran-90-kind-parameter – Ian Bush Apr 12 '20 at 18:18
@IanBush: as to your first comment - yes. – High Performance Mark Apr 12 '20 at 21:46

Memories · Accepted Answer · 2020-04-13T07:51:27.557

Here are my solution following the nice ideas above. There is still some room for efficiency gain before OpenMP. For example, the first k loop in the subroutine can be eliminated by sum function.

program main

  implicit none

  integer :: NN, KK, MM
  integer, allocatable, dimension(:,:) :: id
  complex*16, allocatable, dimension(:) :: phase
  complex*16 :: phase_base(3)
  real*8, allocatable, dimension(:,:) :: wave_base

  complex*16, allocatable, dimension(:,:) :: wave
  integer :: i, j, k, n

  NN = 1000
  KK = 200
  MM = 200

  allocate(id(MM,3))
  allocate(phase(KK))
  allocate(wave_base(KK, NN*(NN+1)/2 ))
  allocate(wave(NN, NN))

  id(:,:) = 2

  phase_base(:) = (1.0d0,1.0d0)

  wave_base(:,:) = 1.0d0

  phase(:) = (1.0d0,1.0d0)

  call  noise_wave(NN, KK, MM, id, phase, phase_base, wave_base, wave)

  deallocate(id)
  deallocate(phase)
  deallocate(wave_base)
  deallocate(wave)

end program main

subroutine noise_wave(NN, KK, MM, id, phase_1, phase_base, wave_base, wave)
  implicit none

  integer, intent(in) :: NN, KK, MM
  integer, intent(in), dimension(MM, 3) :: id
  complex*16, intent(in) :: phase_1(KK)
  complex*16, intent(in) :: phase_base(3)
  real*8,     intent(in) :: wave_base(KK, NN*(NN+1)/2 )
  complex*16, intent(out):: wave(NN, NN)

  integer :: i, j, k, p, n
  integer :: x, y, z
  real :: start, finish
  complex*16 :: phase_2, phase_2_conjg
  complex*16 :: wave_tmp(NN*(NN+1)/2)
  complex*16 :: wave_tmp_2(NN*(NN+1)/2)

  do k = 1, KK

    wave_tmp(:) = wave_tmp(:) + wave_base(k,:) * phase_1(k)

  enddo

  do p = 1, MM
    phase_2 = product(phase_base(:)**id(p,:) )
    phase_2_conjg = conjg(phase_2)

    wave_tmp2(:) = wave_tmp2(:) + wave_tmp(n) * phase_2_conjg
  enddo

  n = 0
  do j = 1, NN
    do i = 1, j
        n = n + 1
        wave(i,j) = wave_tmp2(n)
        wave(j,i) = conjg(wave_tmp2(n) )
    enddo
  enddo

end subroutine

How to optimize this Fortran subroutine with many loops?

2 Answers2

Linked