How can I improve performance of this Fortran code interfacing with Python?

Question

I am studying the possibility to interface Fortran code to be used in Python. I know of f2py but, since I did not succeed in using it with external libraries (such as lapack), I reverted to use ctypes. I have everything working, but I noticed this behavior: for the same algorithm (a simple multiplication of two matrices element-wise) the Fortran code is slower than numpy, while I expected it to be faster. I tried with two different PC, one with Windows 10 and the other with Windows 11. I used gfortran from MinGW-64 installed through winlibs.com version gcc-12.2.0-llvm-15.0.7-mingw-w64ucrt-10.0.0-r4, my Python version is 3.9.13

The measured times with timeit are (for matrices 1000x1000)

With Numpy: 2.77 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

With Fortran: 12.4 ms ± 314 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The Fortran code is in a file "libfortran3.f90"

subroutine prodotto(a, b, n) bind(c, name='prodotto')
  ! declarations
  use iso_c_binding, only: c_double, c_int

  integer(c_int), intent(in) :: n
  real(c_double), dimension(n,n), intent(in) :: b
  real(c_double), dimension(n,n), intent(inout) :: a
  
  do i = 1,n
    do j = 1,n
        a(i,j) = a(i,j)*b(i,j)
    end do
  end do
  
end subroutine prodotto

and it is compiled with gfortran -O3 -funroll-loops -ffast-math -fPIC -shared -o libfortran3.so libfortran3.f90. Please note that I used the same matrix for input and output so I do not have to pass another matrix for the result (since I expect that this worsen the problem).

The Python code (run in a notebook) is

from ctypes import CDLL, POINTER, c_int, c_double
import numpy as np
import time
import sys

fortran = CDLL('./libfortran3.so')
fortran.prodotto.argtypes = [ POINTER(c_double), 
                                 POINTER(c_double), 
                                 POINTER(c_int)]
fortran.prodotto.restype = None
#
N = 10
A = np.random.rand(N,N)
B = np.random.rand(N,N)
#
print('With Numpy')
%timeit A*B
#
print('With Fortran')
A = np.asfortranarray(A, dtype=c_double)
B = np.asfortranarray(B, dtype=c_double)
Act = A.ctypes.data_as(POINTER(c_double))
Bct = B.ctypes.data_as(POINTER(c_double))
%timeit fortran.prodotto( Act, Bct, c_int(N) )

Of course my Fortran code is not as optimized as Numpy, but I would like to know if there is something which I could modify in order to achieve better performances.

Edit April, 3rd, 2023

Thanks to mkrieger1 and janneb, indeed I made a poor decision in the nested loop! Switching i with j in "libfortran3.f90" gives me this timing (with matrices 1000x1000):

With Numpy: 2.67 ms ± 148 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

With Fortran: 1.01 ms ± 34.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

I also tried with f2py, with the following code in file "libfortran1.f90"

subroutine prodotto(a, b, c, n) 
  ! declarations

  integer, intent(in) :: n
  real(kind=8), dimension(n,n), intent(in) :: a, b
  real(kind=8), dimension(n,n), intent(out) :: c
  
  do j = 1,n
    do i = 1,n
        c(i,j) = a(i,j)*b(i,j)
    end do
  end do
  
end subroutine prodotto

compiled with python -m numpy.f2py --compiler=mingw32 --fcompiler=gfortran -c libfortran1.f90 -m libfortran1, running it in the notebook gives me:

With f2py: 12.3 ms ± 260 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

which is way slower than pure numpy.

Just for completeness, if I try to compile this fortran code which uses Lapack (which I built on my system) in "libfortran0.f90"

subroutine risolvi(A,b,x,n) bind(c, name='risolvi')
  ! declarations
  use iso_c_binding, only: c_double, c_int

  integer(c_int), intent(in) :: n
  real(c_double), dimension(n,n), intent(inout) :: A
  real(c_double), dimension(n), intent(inout) :: b
  real(c_double), dimension(n), intent(out) :: x
  integer(c_int), dimension(n) :: pivot(n)
  integer(c_int) :: ok
  ! call LAPACK subroutine
  call DGESV(n, 1, A, n, pivot, b, n, ok) 
  x = b
  
end subroutine risolvi

compiled with gfortran -O3 -funroll-loops -ffast-math -fPIC -shared -L. -lblas -llapack -o libfortran0.so libfortran0.f90, and I compare with numpy

from ctypes import CDLL, POINTER, c_int, c_double
import numpy as np
import time
import sys
import libfortran1

fortran = CDLL('./libfortran0.so')
fortran.risolvi.argtypes = [ POINTER(c_double), 
                                 POINTER(c_double), 
                                 POINTER(c_double), 
                                 POINTER(c_int)]
fortran.risolvi.restype = None
#
N = 1000
A = np.random.rand(N,N)
b = np.random.rand(N)
#
print('With Numpy')
%timeit np.linalg.solve(A,b)
#
print('With Fortran')
A = np.asfortranarray(A, dtype=c_double)
b = np.asfortranarray(b, dtype=c_double)
x = np.asfortranarray(np.zeros(N), dtype=c_double)
Act = A.ctypes.data_as(POINTER(c_double))
bct = b.ctypes.data_as(POINTER(c_double))
xct = x.ctypes.data_as(POINTER(c_double))
fortran.risolvi( Act, bct, xct, c_int(N) )
%timeit fortran.risolvi( Act, bct, xct, c_int(N) )

I obtain the following timimgs (with N=1000)

With Numpy: 13.6 ms ± 596 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

With Fortran: 221 ms ± 9.74 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Anyway, with N=30 the use of Fortran is faster

With Numpy: 15.6 µs ± 244 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

With Fortran: 11.4 µs ± 237 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Is it possible that there is something wrong in my code when passing large arrays? Or is just that, as I read, "ctypes is made for compatibility, not speed"?

As @mkrieger1 suggests your loops are the wrong way around. Also what flags did you use to compile the Fortran? Also is the lapack you are using threaded? If so how many threads are you using? — Ian Bush, Mar 31 '23 at 19:00
@TimRoberts The element-wise A*B is a simple operation, no LAPACK or even BLAS. LAPACK is for linear systems, eigenvalues and similar. This one is https://stackoverflow.com/questions/7621520/element-wise-vector-vector-multiplication-in-blas — Vladimir F Героям слава, Mar 31 '23 at 20:40

score 1 · Accepted Answer · answered Mar 31 '23 at 20:13

As mentioned in a comment, you're looping the wrong way in your Fortran code, as Fortran is column-major. That being said, in this case since it's an element by element multiplication, you can just replace the nested do loop with a simple

a = a * b

Secondly, N=10 is quite short, and the extra overhead to call an external function obscures the performance of the actual code.

Setting N=100 and the replacement of the manual loop with the array multiplication, I get

With Numpy
5.00679704999493
With Fortran
3.699547360010911

How can I improve performance of this Fortran code interfacing with Python?

1 Answers1

Linked