I am studying the possibility to interface Fortran code to be used in Python. I know of f2py but, since I did not succeed in using it with external libraries (such as lapack), I reverted to use ctypes. I have everything working, but I noticed this behavior: for the same algorithm (a simple multiplication of two matrices element-wise) the Fortran code is slower than numpy, while I expected it to be faster. I tried with two different PC, one with Windows 10 and the other with Windows 11. I used gfortran from MinGW-64 installed through winlibs.com version gcc-12.2.0-llvm-15.0.7-mingw-w64ucrt-10.0.0-r4, my Python version is 3.9.13
The measured times with timeit are (for matrices 1000x1000)
With Numpy: 2.77 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
With Fortran: 12.4 ms ± 314 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The Fortran code is in a file "libfortran3.f90"
subroutine prodotto(a, b, n) bind(c, name='prodotto')
! declarations
use iso_c_binding, only: c_double, c_int
integer(c_int), intent(in) :: n
real(c_double), dimension(n,n), intent(in) :: b
real(c_double), dimension(n,n), intent(inout) :: a
do i = 1,n
do j = 1,n
a(i,j) = a(i,j)*b(i,j)
end do
end do
end subroutine prodotto
and it is compiled with gfortran -O3 -funroll-loops -ffast-math -fPIC -shared -o libfortran3.so libfortran3.f90
.
Please note that I used the same matrix for input and output so I do not have to pass another matrix for the result (since I expect that this worsen the problem).
The Python code (run in a notebook) is
from ctypes import CDLL, POINTER, c_int, c_double
import numpy as np
import time
import sys
fortran = CDLL('./libfortran3.so')
fortran.prodotto.argtypes = [ POINTER(c_double),
POINTER(c_double),
POINTER(c_int)]
fortran.prodotto.restype = None
#
N = 10
A = np.random.rand(N,N)
B = np.random.rand(N,N)
#
print('With Numpy')
%timeit A*B
#
print('With Fortran')
A = np.asfortranarray(A, dtype=c_double)
B = np.asfortranarray(B, dtype=c_double)
Act = A.ctypes.data_as(POINTER(c_double))
Bct = B.ctypes.data_as(POINTER(c_double))
%timeit fortran.prodotto( Act, Bct, c_int(N) )
Of course my Fortran code is not as optimized as Numpy, but I would like to know if there is something which I could modify in order to achieve better performances.
Edit April, 3rd, 2023
Thanks to mkrieger1 and janneb, indeed I made a poor decision in the nested loop! Switching i
with j
in "libfortran3.f90" gives me this timing (with matrices 1000x1000):
With Numpy: 2.67 ms ± 148 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
With Fortran: 1.01 ms ± 34.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I also tried with f2py, with the following code in file "libfortran1.f90"
subroutine prodotto(a, b, c, n)
! declarations
integer, intent(in) :: n
real(kind=8), dimension(n,n), intent(in) :: a, b
real(kind=8), dimension(n,n), intent(out) :: c
do j = 1,n
do i = 1,n
c(i,j) = a(i,j)*b(i,j)
end do
end do
end subroutine prodotto
compiled with python -m numpy.f2py --compiler=mingw32 --fcompiler=gfortran -c libfortran1.f90 -m libfortran1
, running it in the notebook gives me:
With f2py: 12.3 ms ± 260 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
which is way slower than pure numpy.
Just for completeness, if I try to compile this fortran code which uses Lapack (which I built on my system) in "libfortran0.f90"
subroutine risolvi(A,b,x,n) bind(c, name='risolvi')
! declarations
use iso_c_binding, only: c_double, c_int
integer(c_int), intent(in) :: n
real(c_double), dimension(n,n), intent(inout) :: A
real(c_double), dimension(n), intent(inout) :: b
real(c_double), dimension(n), intent(out) :: x
integer(c_int), dimension(n) :: pivot(n)
integer(c_int) :: ok
! call LAPACK subroutine
call DGESV(n, 1, A, n, pivot, b, n, ok)
x = b
end subroutine risolvi
compiled with gfortran -O3 -funroll-loops -ffast-math -fPIC -shared -L. -lblas -llapack -o libfortran0.so libfortran0.f90
, and I compare with numpy
from ctypes import CDLL, POINTER, c_int, c_double
import numpy as np
import time
import sys
import libfortran1
fortran = CDLL('./libfortran0.so')
fortran.risolvi.argtypes = [ POINTER(c_double),
POINTER(c_double),
POINTER(c_double),
POINTER(c_int)]
fortran.risolvi.restype = None
#
N = 1000
A = np.random.rand(N,N)
b = np.random.rand(N)
#
print('With Numpy')
%timeit np.linalg.solve(A,b)
#
print('With Fortran')
A = np.asfortranarray(A, dtype=c_double)
b = np.asfortranarray(b, dtype=c_double)
x = np.asfortranarray(np.zeros(N), dtype=c_double)
Act = A.ctypes.data_as(POINTER(c_double))
bct = b.ctypes.data_as(POINTER(c_double))
xct = x.ctypes.data_as(POINTER(c_double))
fortran.risolvi( Act, bct, xct, c_int(N) )
%timeit fortran.risolvi( Act, bct, xct, c_int(N) )
I obtain the following timimgs (with N=1000)
With Numpy: 13.6 ms ± 596 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
With Fortran: 221 ms ± 9.74 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Anyway, with N=30 the use of Fortran is faster
With Numpy: 15.6 µs ± 244 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
With Fortran: 11.4 µs ± 237 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Is it possible that there is something wrong in my code when passing large arrays? Or is just that, as I read, "ctypes is made for compatibility, not speed"?