How to vectorize these do loops?

Question

I am dealing with a 4-fold loop. It is very slow because of the intrinsic function exp() in the inner loop. Here is a small example:

program main
  implicit none
  integer :: n, m, k, w
  real*8 :: a(100), b(100), c(1000), d(4)
  real*8 :: self(4,1000)

  a = 1.0d0
  b = 1.0d0
  c = 1.0d0
  d = 1.0d0

  self=0.0d0
  do n = 1, 100
    do m = 1, 100
      do k = 1, 1000
        do w = 1, 4
          self(w,k) = self(w,k) + exp( ( (c(k)-a(n))**2 + (c(k)-b(m))**2 ) / (2.0d0*d(w)**2) )
        enddo
      enddo
    enddo
  enddo

  ! my optimization:
  self=0.0d0
  do n = 1, 100
    do m = 1, 100
      !do k = 1, 1000
        do w = 1, 4
          self(w,:) = self(w,:) + exp( ( (c(:)-a(n))**2 + (c(:)-b(m))**2 ) / (2.0d0*d(w)**2) )
        enddo
      !enddo
    enddo
  enddo

end program

It looks like the fortran intrinsic function exp() is not efficient. Yet, I do not want to rewrite the exp() to be some equivalent expressions.

The problem is not that exp is inefficient, executing 100x100x1000x4 exp functions is just heavy — kvantour, May 02 '20 at 09:35
Yes, this is just a demo. The actual loop size is 1million * 500 * 500 * 4 * 3, the outmost 1 million loop is parallelized — Memories, May 02 '20 at 09:39
your exponent contains 2 parts which only contain 3 indices, you can precompute them and then only perform a final multiplication. This will already do some speed up. (precompute all `exp((c(k)-a(n)**2/(2*w(k)))` and all `exp((c(k)-b(m)**2/(2*w(k)))`) that leads to 2x100x1000x4 exp computations instead 2x100x100x1000x4 — kvantour, May 02 '20 at 09:50
And for the bigger calculation if you can't afford the memory break it into blocks where you precompute 1 block, do all the work with that block, and move onto the next — Ian Bush, May 02 '20 at 09:57
Also please don't use the non-standard, non-portable real*8 which is not and never has been part of Fortran. Please see https://stackoverflow.com/questions/838310/fortran-90-kind-parameter — Ian Bush, May 02 '20 at 10:33
Remarkably similar to https://stackoverflow.com/questions/61174309/how-to-optimize-this-fortran-subroutine-with-many-loops/61175101. The principal lesson in the answers to that question apply here too - don't bury computations too deeply inside loop nests and repeat them needlessly. — High Performance Mark, May 02 '20 at 10:56
@kvantour, No, you can't do this. There is no such array called `w` of size 1000 and all four indices are used inside the `exp` function, there is `k`, `m`, `n`, and `w`. — AboAmmar, May 03 '20 at 02:16
@AboAmmar That is a typo and copy paste from the typo. You could also have said that I missed a bracket.: `exp( (c(k)-b(m))**2/(2*d(w)) )` they are of this form and I hope I did not make a mistake now. — kvantour, May 03 '20 at 13:58

score 0 · Answer 1 · answered May 03 '20 at 13:03

I posted my answer here, based on some nice comments above:

program main
  implicit none
  integer :: n, m, w
  real*8 :: a(100), b(100), c(1000), d(4)
  real*8 :: self(1000,4), exp_tmp(1000,4,100)

  a = 1.0d0
  b = 1.0d0
  c = 1.0d0
  d = 1.0d0

  do n = 1, 100
    do w = 1, 4
      exp_tmp(:,w,n) = exp( 0.5d0 * ( (c(:)-a(n))/d(w) )**2 )
    enddo
  enddo

  self=0.0d0
  do n = 1, 100
    do m = 1, 100
      do w = 1, 4
        self(:,w) = self(:,w) + exp_tmp(:,w,n) * exp_tmp(:,w,m)
      enddo
    enddo
  enddo

end program

I do not know whether the loops could be vectorized to be compact.

be aware that this is not the same. You are missing an `exp_tmp2` which handels array `b` — kvantour, May 03 '20 at 13:59

How to vectorize these do loops?

1 Answers1