Optimize fortran code that is running slower than the its python version

Question

I have a Fortran code that I'm compiling with f2py to run in python. I wrote this code to be a faster approach to a python code that already existed, but it's actually running slower than its python counterpart, which makes me think that it isn't optimized. (This is probably related to this question although the culprit in that case was something that I'm not using here, so it doesn't apply to me.)

The code gets a 4D matrix as input performs a 2D correlation using the dimensions 3 and 4 (as if they were x and y).

The code is this one:

SUBROUTINE correlate4D(Vars, nxc, nyc, nv, nt, vCorr)
! Correlates 4D array assuming that x, y are dims 3 and 4
! Returns a 4D array with shape nv, nt, 2*nxc+1, 2*nyc+1
IMPLICIT NONE
real(kind=8), intent(in), dimension(:,:,:,:) :: Vars
real(kind=8) :: dummysize
integer, intent(in) :: nxc, nyc
integer :: ii, i, iix, iiy, iv, it, dims(4), nv, nt, nx, ny
integer, dimension(2*nxc+1) :: xdel
integer, dimension(2*nyc+1) :: ydel
real(kind=8), intent(out) :: vCorr(nv, nt, 2*nxc+1, 2*nyc+1)
real(kind=8), dimension(:,:,:,:), allocatable :: rolled, prerolled
real(kind=8), dimension(:,:), allocatable :: Mean
real(kind=8), dimension(:,:,:), allocatable :: Mean3d

dims = shape(Vars)
nx=dims(3)
ny=dims(4)
dummysize=nx*ny
allocate(rolled(nv, nt, nx, ny))
allocate(prerolled(nv, nt, nx, ny))
allocate(Mean3d(nv, nt, nx))
allocate(Mean(nv, nt))

Mean3d = sum(Vars, dim=4)/size(Vars, dim=4)
Mean = sum(Mean3d, dim=3)/size(Mean3d, dim=3)

! These replace np.arange()
ii=1
do i=-nxc,+nxc
    xdel(ii)=i
    ii=ii+1
enddo
ii=1
do i=-nyc,+nyc
    ydel(ii)=i
    ii=ii+1
enddo

! Calculate the correlation
do iiy=1,size(ydel)
    print*,'fortran loop:',iiy,' of', size(ydel)
    ! cshift replaces np.roll()
    prerolled = cshift(Vars, ydel(iiy), dim=4)
    do iix=1,size(xdel)
        rolled = cshift(prerolled, xdel(iix), dim=3)
        forall (it=1:nt)
            forall (iv=1:nv)
                vCorr(iv,it,iix,iiy) = (sum(Vars(iv,it,:,:) * rolled(iv,it,:,:))/dummysize) / (Mean(iv,it)**2)
            endforall
        endforall
    enddo
enddo

END SUBROUTINE

Running this with a matrix of the size (3, 50, 100, 100) takes 251 seconds with this code compiled with f2py and takes only 103 seconds with pure python/numpy code. By the way, this is somewhat the average size of matrix this takes as input should be something like (3, 300, 100, 100), but not much larger than that.

Can anyone point out to me ways in which this code can be optimized?

EDIT

I'm compiling with f2py3.4 -c mwe.f90 -m mwe, and then it can be called with

In [1]: import mwe
In [2]: import numpy as np
In [3]: a=np.random.randn(3,15,100,100)
In [4]: mwe.correlate4d(a, 50, 50, 3, 15)

EDIT2

After reading the comments, I was able to improve it by changing the order of the indices. Now it is about 10% faster than Python, but it's still too slow. I'm sure this can be done faster.

SUBROUTINE correlate4D2(Vars, nxc, nyc, nt, nv, vCorr)
! Correlates 4D array assuming that x, y are dims 1 and 2
! Returns a 4D array with shape 2*nxc+1, 2*nyc+1, nt, nv
IMPLICIT NONE
INTEGER, PARAMETER  ::  dp = SELECTED_REAL_KIND (13)
real(kind=8), intent(in), dimension(:,:,:,:) :: Vars
real(kind=8) :: dummysize
integer, intent(in) :: nxc, nyc
integer :: ii, i, iix, iiy, iv, it, dims(4), nv, nt, nx, ny
integer, dimension(2*nxc+1) :: xdel
integer, dimension(2*nyc+1) :: ydel
!real(kind=8), intent(out) :: vCorr(nv, nt, 2*nxc+1, 2*nyc+1)
real(kind=8), intent(out) :: vCorr(2*nxc+1, 2*nyc+1, nt, nv)
real(kind=8), dimension(:,:,:,:), allocatable :: rolled, prerolled
real(kind=8), dimension(:,:), allocatable :: Mean
real(kind=8), dimension(:,:,:), allocatable :: Mean3d

dims = shape(Vars)
nx=dims(1)
ny=dims(1)
dummysize=nx*ny
allocate(rolled(nx, ny, nt, nv))
allocate(prerolled(nx, ny, nt, nv))
allocate(Mean3d(ny, nt, nv))
allocate(Mean(nt, nv))

Mean3d = sum(Vars, dim=1)/size(Vars, dim=1)
Mean = sum(Mean3d, dim=1)/size(Mean3d, dim=1)

ii=1
do i=-nxc,+nxc
    xdel(ii)=i
    ii=ii+1
enddo
ii=1
do i=-nyc,+nyc
    ydel(ii)=i
    ii=ii+1
enddo

do iiy=1,size(ydel)
    print*,'fortran loop:',iiy,' of', size(ydel)
    prerolled = cshift(Vars, ydel(iiy), dim=2)
    do iix=1,size(xdel)
        rolled = cshift(prerolled, xdel(iix), dim=1)
        forall (iv=1:nv)
            forall (it=1:nt)
                vCorr(iix,iiy,it,iv) = (sum(Vars(:,:,it,iv) * rolled(:,:,it,iv))/dummysize) / (Mean(it,iv)**2)
            endforall
        endforall
    enddo
enddo

END SUBROUTINE

Also, even though there's a dp parameter in the code now (which returns 8, as it should), if I declare variables with real(dp) f2py throws me this error: Parameter 'dp' at (1) has not been declared or is a variable, even though it is declared. So that's why I'm using 8 directly.

I'm not sure there's a solid reason to expect your fortran code to be faster than the heavily-optimized C code that numpy is written in. — A_K, Apr 14 '17 at 20:20
@user2752364 But the loops are still the same in python, and python loops are supposed to be really slow. The only things that numpy does in the code above is the means and the `cshift` function. The rest is the same. — TomCho, Apr 14 '17 at 20:23
Well a quick glance tells me the memory access pattern in the inner loops is HORRRIBLE for Fortran, while, if I understand things correctly, what you want for python. It may be as simple as that. How are you compiling this? Can you provide a simple driver program? And real(kind=8) is non-portable and bad practice. — Ian Bush, Apr 14 '17 at 20:50
Those `kind=8` hurt my eyes. Where do the people get it from? — Vladimir F Героям слава, Apr 14 '17 at 20:50
Yes, the `iix` and `iiy` loops are in bad order, but they are after, so I am not sure how big effect it has. Also the forall is not a good thing for performance. It was supposed to be, but it is not in reality. — Vladimir F Героям слава, Apr 14 '17 at 21:25
Is the innermost line (vCorr = sum*rolled/mean) handled by numpy as well? If so, pretty much every nontrivial computation is already happening at C level in Python and so I'm not surprised it's pretty fast. — DSM, Apr 14 '17 at 21:40
@IanBush Sorry, what is the best way then? I'm not used to programming in fortran. I did it that way because I read somewhere that that's how fortran stores the matrices — TomCho, Apr 14 '17 at 22:27
@DSM That's a `np.mean(Vars*rolled)` call in python. But I'm not sure how optimized this is supposed to be. — TomCho, Apr 14 '17 at 22:28
@IanBush I'll provide the f2py command to compile it. About the `kind=8`, I'm only using it because f2py gives me an error if I use `real(dp)` — TomCho, Apr 14 '17 at 22:35
Who knows what *"an error"* might be... if dp = 8 the behaviour should be exactly the same. — Vladimir F Героям слава, Apr 14 '17 at 23:23
Yes, the sums are very non-contiguous. You'd better switch the indexes so that you can do `sum(Vars(:,:,iv,it))`. — Vladimir F Героям слава, Apr 14 '17 at 23:28
@VladimirF I included the error with my edit, which also has an updated version of the code, which runs faster but not as fast as it should (I think). — TomCho, Apr 14 '17 at 23:57
Does changing forall(...) to the usual do-loop make any improvement? (but I guess the slow part may be cshift() as compared to np.roll() etc, though not sure at all...). Rewriting the entire loops without cshift() (e.g., using shift indices) and avoiding temporaries may give speedup (but not sure atm...) — roygvib, Apr 15 '17 at 01:54
@roygvib Having it was a do-loop slows it down a bit, but makes little difference (at least for the array sizes that I tested with). And yeah, I think you're right about `cshift`. But I don't really know how to optimize that — TomCho, Apr 15 '17 at 01:57
You might replace the 2 divisions by multiply and 1 division — tim18, Apr 15 '17 at 05:21
OK, looks like f2py3.4 is badly broken as indeed it doesn't accept real(kind=dp) in the form you report above. This should be reported. However a if you stick the specification of dp in a module and use the module it works. As this is probably the best way anyway this is what I suggest you do. — Ian Bush, Apr 15 '17 at 08:09
BTW if I understand what you are doing correctly and you are worried about speed why aren't you using Fourier transform based methods? — Ian Bush, Apr 15 '17 at 08:10
At least on my laptop it looks as though replacing the foralls with do loops can give an appreciable speed up - which is makes me happy as I recommend people avoiding forall. However the timings on my laptop are not very repeatable, when I get near a proper computer with all the software on I'll try to remember to take another look. — Ian Bush, Apr 15 '17 at 08:32

roygvib · Accepted Answer · 2017-04-16T05:46:36.297

Note: A rather long and boring answer follows...

Because it seems expensive to use cshift() repeatedly for large matrices, I have tried some modifications around cshift. To do so, I first created a minimal version of the OP's code:

program main
    implicit none
    integer, parameter :: N = 100, nt = 50, dp = kind(0.0d0)
    real(dp), allocatable, dimension(:,:,:) :: A, Ashift_y, Ashift, B
    integer :: sx, sy, i, t

    allocate( A( N, N, nt ), Ashift_y( N, N, nt ), Ashift( N, N, nt ), &
              B( -N:N-1, -N:N-1, nt ) )
    call initA

    do sy = -N, N-1
        if ( mod( sy, N/10 ) == 0 ) print *, "sy = ", sy

        Ashift_y = cshift( A, sy, dim=2 )

        do sx = -N, N-1
            Ashift = cshift( Ashift_y, sx, dim=1 )

            do t = 1, nt
                B( sx, sy, t )= sum( A( :, :, t ) * Ashift( :, :, t ) )
            enddo
        enddo
    enddo

    print *, "sum(B) = ", sum(B)
    print *, "sum( B( 0:N-1, 0:N-1, : ) ) = ", sum( B( 0:N-1, 0:N-1, : ) )

contains
    subroutine initA
        integer ix, iy
        forall( t = 1:nt, iy = 1:N, ix = 1:N )  &   ! (forall not recommended but for brevity)
                A( ix, iy, t ) = 1.0_dp / ( mod(ix + iy + t, 100) + 1 )
    endsubroutine    
endprogram

which gives

sum(B) =    53817771.021093562     
sum( B( 0:N-1, 0:N-1, : ) ) =    13454442.755258575 

Mac mini (2.3GHz,4-core), gfortran-6.3 -O3 : 50 sec
Linux (2.6GHz,16-core),   gfortran-4.8 -O3 : 32 sec

Next, because cshift(A,s,dim=1 (or 2)) is periodic with respect to the shift s (with periodicity N), the loop over sx and sy can be split into four parts and only the first quadrant is retained (i.e., sx and sy in [0,N-1]). The data of other quadrants are obtained by simply copying the data of the first quadrant. This gives a reduction of the CPU time by 4. (More simply, we could calculate sx and sy only in [-N/2,N/2] because B for other regions give no new information.)

    do sy = 0, N-1
        if ( mod( sy, N/10 ) == 0 ) print *, "sy = ", sy
        Ashift_y = cshift( A, sy, dim=2 )

        do sx = 0, N-1
            Ashift = cshift( Ashift_y, sx, dim=1 )

            do t = 1, nt
                B( sx, sy, t )= sum( A( :, :, t ) * Ashift( :, :, t ) )
            enddo
        enddo
    enddo

    print *, "sum( B( 0:N-1, 0:N-1, : ) ) = ", sum( B( 0:N-1, 0:N-1, : ) )

    !! Make "full" B.
    B( -N :  -1,  0 : N-1, : ) = B( 0 : N-1, 0 : N-1, : )
    B(  0 : N-1, -N :  -1, : ) = B( 0 : N-1, 0 : N-1, : )
    B( -N :  -1, -N :  -1, : ) = B( 0 : N-1, 0 : N-1, : )
    print *, "sum(B) = ", sum(B)

The result agrees with the full calculation, as expected:

sum(B) =    53817771.021093562     
sum( B( 0:N-1, 0:N-1, : ) ) =    13454442.755258575     

Mac   : 12.8 sec
Linux :  8.3 sec

The corresponding Python code may look like this:

from __future__ import print_function, division
import numpy as np

N, nt = 100, 50

A = np.zeros( (nt, N, N) )
B = np.zeros( (nt, N, N) )

for t in range(nt):
    for iy in range(N):
        for ix in range(N):
            A[ t, iy, ix ] = 1.0 / ( (ix + iy + t) % 100 + 1 )

for sy in range( N ):
    if sy % (N // 10) == 0 : print( "sy = ", sy )
    Ashift_y = np.roll( A, -sy, axis=1 )

    for sx in range( N ):
        Ashift = np.roll( Ashift_y, -sx, axis=2 )

        for t in range( nt ):
            B[ t, sy, sx ] = np.sum( A[ t, :, : ] * Ashift[ t, :, : ] )

print( "sum( B( :, 0:N-1, 0:N-1 ) ) = ",  np.sum( B ) )

which runs with 22--24 sec on both Mac and Linux (python3.5).

To further reduce the cost, we utilize the fact that cshift may be used in two equivalent ways:

cshift( array, s ) == array( cshift( [1,2,...,n], s ) )   !! assuming that "array" is declared as a( n )

Then we can rewrite the above code such that cshift() receives only ind = [1,2,...,N]:

    integer, dimension(N) :: ind, indx, indy
    ind = [( i, i=1,N )]

    do sy = 0, N-1
        if ( mod( sy, N/10 ) == 0 ) print *, "sy = ", sy
        indy = cshift( ind, sy )

        do sx = 0, N-1
            indx = cshift( ind, sx )

            do t = 1, nt
                B( sx, sy, t )= sum( A( :, :, t ) * A( indx, indy, t ) )
            enddo
        enddo
    enddo

which runs in ~5 sec on both Mac and Linux. Similar methods may be applicable to Python as well. (I also tried using mod() explicitly for indices to eliminate cshift completely, but a bit surprisingly, it was slower by more than two times than the above code...)

Even with this reduction, the code becomes slow for large nt (say 300 as shown in the question). In that case, we may resort to the final weapon such that the loop over sy is parallelized:

program main
    implicit none
    integer, parameter :: N = 100, nt = 50, dp = kind(0.0d0)
!    integer, parameter :: N = 100, nt = 300, dp = kind(0.0d0)
    real(dp), allocatable, dimension(:,:,:) :: A, B
    integer, dimension(N) :: ind, indx, indy
    integer :: sx, sy, i, t

    allocate( A( N, N, nt ), B( -N:N-1, -N:N-1, nt ) )
    call initA
    ind = [( i, i=1,N )]

    !$omp parallel do private(sx,sy,t,indx,indy)
    do sy = 0, N-1
        if ( mod( sy, N/10 ) == 0 ) print *, "sy = ", sy
        indy = cshift( ind, sy )

        do sx = 0, N-1
            indx = cshift( ind, sx )

            do t = 1, nt
                B( sx, sy, t )= sum( A( :, :, t ) * A( indx, indy, t ) )
            enddo
        enddo
    enddo
    !$omp end parallel do

    print *, "sum( B( 0:N-1, 0:N-1, : ) ) = ", sum( B( 0:N-1, 0:N-1, : ) )

    ! "contains subroutine initA ..." here

endprogram

The timing data are like this (with gfortran -O3 -fopenmp):

N = 100, nt = 50
sum( B( 0:N-1, 0:N-1, : ) ) =    13454442.755258575     
Mac:
   serial : 5.3 sec
2 threads : 2.6 sec
4 threads : 1.4 sec

N = 100, nt = 50
sum( B( 0:N-1, 0:N-1, : ) ) =    13454442.755258575     
Linux:
    serial : 4.8 sec
 2 threads : 2.7 sec
 4 threads : 1.3 sec
 8 threads : 0.7 sec
16 threads : 0.4 sec
32 threads : 0.4 sec

N = 100, nt = 300   // heavy case
sum( B( 0:N-1, 0:N-1, : ) ) =    80726656.531429410     
Linux:
 2 threads: 16.5 sec
 4 threads:  8.4 sec
 8 threads:  4.4 sec
16 threads:  2.5 sec

So, if the above code does not have a bug (hopefully!), we can save much CPU time by (1) limiting sx and sy to [0,N-1] (or more simply [-N/2,N/2] without further copy), (2) applying cshift to an index array (rather than data arrays), and/or (3) parallelization over sy (which may be combined with f2py hopefully...)

As a non-fortran programmer I gotta say this looks amazing. I don't know if I'm going to be able to apply everything though, since I'm compiling with f2py, which has a bunch of restrictions. — TomCho, Apr 16 '17 at 01:33
@TomCho Even in pure Python, there seems room for reducing CPU time (say, simply by limiting iix and iiy to [-nxc/2, nxc/2] and [nyc/2, nyc/2] because other values of iix and iiy are redundant and give no new information). As for parallel, f2py + OpenMP might be tricky...(no experience so not sure at all, but it might be doable). Another speedup option may be 1) to use fast python alternatives like cython (but it takes time to check the syntax), and 2) some parallel facilities in Python, and 3) use FFT etc if possible (as suggested in the comments). — roygvib, Apr 16 '17 at 03:39
nyc and nxc are already supposed to be Nxy/2, even in my original code. But I guess that wasn't explicit. FFT might be possible, but since this is isn't exactly a correlation, I still wasn't able to do it with FFTs. — TomCho, Apr 16 '17 at 04:33
I see... then the gain from the above answer is rather limited. I also thought about FFT, but it seems not clear how to do it (atm). For this information, it might be useful to post a question also here https://scicomp.stackexchange.com/ (with more focus on searching for more efficient algorithms). — roygvib, Apr 16 '17 at 05:34

Optimize fortran code that is running slower than the its python version

1 Answers1