I am dealing with a 4-fold loop. It is very slow because of the intrinsic function exp() in the inner loop. Here is a small example:
program main
implicit none
integer :: n, m, k, w
real*8 :: a(100), b(100), c(1000), d(4)
real*8 :: self(4,1000)
a = 1.0d0
b = 1.0d0
c = 1.0d0
d = 1.0d0
self=0.0d0
do n = 1, 100
do m = 1, 100
do k = 1, 1000
do w = 1, 4
self(w,k) = self(w,k) + exp( ( (c(k)-a(n))**2 + (c(k)-b(m))**2 ) / (2.0d0*d(w)**2) )
enddo
enddo
enddo
enddo
! my optimization:
self=0.0d0
do n = 1, 100
do m = 1, 100
!do k = 1, 1000
do w = 1, 4
self(w,:) = self(w,:) + exp( ( (c(:)-a(n))**2 + (c(:)-b(m))**2 ) / (2.0d0*d(w)**2) )
enddo
!enddo
enddo
enddo
end program
It looks like the fortran intrinsic function exp() is not efficient. Yet, I do not want to rewrite the exp() to be some equivalent expressions.