I have a Fortran code that I'm compiling with f2py to run in python. I wrote this code to be a faster approach to a python code that already existed, but it's actually running slower than its python counterpart, which makes me think that it isn't optimized. (This is probably related to this question although the culprit in that case was something that I'm not using here, so it doesn't apply to me.)
The code gets a 4D matrix as input performs a 2D correlation using the dimensions 3 and 4 (as if they were x and y).
The code is this one:
SUBROUTINE correlate4D(Vars, nxc, nyc, nv, nt, vCorr)
! Correlates 4D array assuming that x, y are dims 3 and 4
! Returns a 4D array with shape nv, nt, 2*nxc+1, 2*nyc+1
IMPLICIT NONE
real(kind=8), intent(in), dimension(:,:,:,:) :: Vars
real(kind=8) :: dummysize
integer, intent(in) :: nxc, nyc
integer :: ii, i, iix, iiy, iv, it, dims(4), nv, nt, nx, ny
integer, dimension(2*nxc+1) :: xdel
integer, dimension(2*nyc+1) :: ydel
real(kind=8), intent(out) :: vCorr(nv, nt, 2*nxc+1, 2*nyc+1)
real(kind=8), dimension(:,:,:,:), allocatable :: rolled, prerolled
real(kind=8), dimension(:,:), allocatable :: Mean
real(kind=8), dimension(:,:,:), allocatable :: Mean3d
dims = shape(Vars)
nx=dims(3)
ny=dims(4)
dummysize=nx*ny
allocate(rolled(nv, nt, nx, ny))
allocate(prerolled(nv, nt, nx, ny))
allocate(Mean3d(nv, nt, nx))
allocate(Mean(nv, nt))
Mean3d = sum(Vars, dim=4)/size(Vars, dim=4)
Mean = sum(Mean3d, dim=3)/size(Mean3d, dim=3)
! These replace np.arange()
ii=1
do i=-nxc,+nxc
xdel(ii)=i
ii=ii+1
enddo
ii=1
do i=-nyc,+nyc
ydel(ii)=i
ii=ii+1
enddo
! Calculate the correlation
do iiy=1,size(ydel)
print*,'fortran loop:',iiy,' of', size(ydel)
! cshift replaces np.roll()
prerolled = cshift(Vars, ydel(iiy), dim=4)
do iix=1,size(xdel)
rolled = cshift(prerolled, xdel(iix), dim=3)
forall (it=1:nt)
forall (iv=1:nv)
vCorr(iv,it,iix,iiy) = (sum(Vars(iv,it,:,:) * rolled(iv,it,:,:))/dummysize) / (Mean(iv,it)**2)
endforall
endforall
enddo
enddo
END SUBROUTINE
Running this with a matrix of the size (3, 50, 100, 100)
takes 251 seconds with this code compiled with f2py
and takes only 103 seconds with pure python/numpy code. By the way, this is somewhat the average size of matrix this takes as input should be something like (3, 300, 100, 100)
, but not much larger than that.
Can anyone point out to me ways in which this code can be optimized?
EDIT
I'm compiling with f2py3.4 -c mwe.f90 -m mwe
, and then it can be called with
In [1]: import mwe
In [2]: import numpy as np
In [3]: a=np.random.randn(3,15,100,100)
In [4]: mwe.correlate4d(a, 50, 50, 3, 15)
EDIT2
After reading the comments, I was able to improve it by changing the order of the indices. Now it is about 10% faster than Python, but it's still too slow. I'm sure this can be done faster.
SUBROUTINE correlate4D2(Vars, nxc, nyc, nt, nv, vCorr)
! Correlates 4D array assuming that x, y are dims 1 and 2
! Returns a 4D array with shape 2*nxc+1, 2*nyc+1, nt, nv
IMPLICIT NONE
INTEGER, PARAMETER :: dp = SELECTED_REAL_KIND (13)
real(kind=8), intent(in), dimension(:,:,:,:) :: Vars
real(kind=8) :: dummysize
integer, intent(in) :: nxc, nyc
integer :: ii, i, iix, iiy, iv, it, dims(4), nv, nt, nx, ny
integer, dimension(2*nxc+1) :: xdel
integer, dimension(2*nyc+1) :: ydel
!real(kind=8), intent(out) :: vCorr(nv, nt, 2*nxc+1, 2*nyc+1)
real(kind=8), intent(out) :: vCorr(2*nxc+1, 2*nyc+1, nt, nv)
real(kind=8), dimension(:,:,:,:), allocatable :: rolled, prerolled
real(kind=8), dimension(:,:), allocatable :: Mean
real(kind=8), dimension(:,:,:), allocatable :: Mean3d
dims = shape(Vars)
nx=dims(1)
ny=dims(1)
dummysize=nx*ny
allocate(rolled(nx, ny, nt, nv))
allocate(prerolled(nx, ny, nt, nv))
allocate(Mean3d(ny, nt, nv))
allocate(Mean(nt, nv))
Mean3d = sum(Vars, dim=1)/size(Vars, dim=1)
Mean = sum(Mean3d, dim=1)/size(Mean3d, dim=1)
ii=1
do i=-nxc,+nxc
xdel(ii)=i
ii=ii+1
enddo
ii=1
do i=-nyc,+nyc
ydel(ii)=i
ii=ii+1
enddo
do iiy=1,size(ydel)
print*,'fortran loop:',iiy,' of', size(ydel)
prerolled = cshift(Vars, ydel(iiy), dim=2)
do iix=1,size(xdel)
rolled = cshift(prerolled, xdel(iix), dim=1)
forall (iv=1:nv)
forall (it=1:nt)
vCorr(iix,iiy,it,iv) = (sum(Vars(:,:,it,iv) * rolled(:,:,it,iv))/dummysize) / (Mean(it,iv)**2)
endforall
endforall
enddo
enddo
END SUBROUTINE
Also, even though there's a dp
parameter in the code now (which returns 8, as it should), if I declare variables with real(dp)
f2py throws me this error: Parameter 'dp' at (1) has not been declared or is a variable
, even though it is declared. So that's why I'm using 8
directly.