0

I have a question concerning Numpy /Python and Fortran running speed. First, to start with, I reprogrammed a running Python program in Fortran. It works fine. But I realized that the Fortran program is losing more and more speed with larger array sizes than Numpy arrays.

Here are some numbers. For low step size Fortran(with Intel Fortran compiler) takes 0,2s, and Python takes 5 seconds. First, when I saw this, I was delighted. But then I reduced the step size, and the Fortran program took 770 s, while the python was 1450 s. That's a loss of almost 10 times. And I suppose if I reduce the step size further, Python will be faster again. That sucks.

And I look at almost all the steps. The Fortran arrays in loops are 10 times slower (with a ten times smaller step size), which is somehow logical. But numpy arrays are only 2-3 times slower.

Does anyone know what these numpy functions do, that they don't lose their speed linearly? Is there anything comparable to do in Fortran?

Here is a short example, but the whole code has more than 1000 columns, so nobody would ever read this. psi is a complex array, r is a real/double array with the length depending on dr. First, the Python code.

phi0= 4* pi * np.cumsum(np.cumsum(r * np.abs(chi)**2) * dr) * dr / r
phi0 += - phi0[-1] - N/r[-1]

with dr=0.1 it takes 0.00006s, dr=0.01 it takes 0.00008s, dr=0.001 it takes 0.0002s

Here ist the fortran code:

 integer :: i,j,m
double precision :: sum2_j, pi=3.14159265359, N, dr, sum1_i
double precision, dimension (:), allocatable ::  sum1_array, phi, step1, r
complex(8), dimension(:), allocatable :: psi
!double precision :: start, finish

m=size(psi)
allocate (phi(m))
allocate (sum1_array(m))
allocate (step1(m))

!call cpu_time(start)

sum1_i=0
step1=r*abs(psi)**2
do i=1,size(psi)
sum1_i=sum1_i+step1(i)
sum1_array(i)=sum1_i*dr
end do


sum2_j=0
do j=1,size(phi)
sum2_j=sum2_j + sum1_array(j)
phi(j)=4*pi*sum2_j*dr/r(j)
end do


phi=phi - phi(size(phi))-N/r(size(r))

The run times/with eclipse/photran(intel fortran about 2 times faster): dr=0.1: 0.0000008s, dr=0.01: 0.00006s, dr=0.001: 0.00045s

As you can see, Python is almost 10 times slower at low step size but even faster at larger step size. This issue concerns the two loops in the FORTRAN code. It is not specific to that code. It occurs in all loops. As I said, it is just an example. There's nothing I wouldn't try so far because I do not understand why this is happening.

Tejas Shetty
  • 685
  • 6
  • 30
fortran1
  • 19
  • 3
  • 1
    it's probably cached. your python code also makes a new array for every. single. operation. push it down if you're looking for speed. check out numba. https://numba.pydata.org/ – luthervespers Jan 21 '21 at 17:40
  • Ok, please a little bit more explanation. What do you mean by 'pushing down'. Numba is for speeding up Python code, but I want my fortran code to be faster. :-) – fortran1 Jan 21 '21 at 17:49
  • Be aware that your value of pi is entered as a single precision, not a double precision constant. https://stackoverflow.com/questions/6146005/is-there-a-better-double-precision-assignment-in-fortran-90 – Vladimir F Героям слава Jan 21 '21 at 17:52
  • thx for that, but it doesn't change anything. – fortran1 Jan 21 '21 at 18:09
  • 4
    I wouldn't trust your Fortran code when it has `m=size(psi)` with `psi` not allocated. – francescalus Jan 21 '21 at 18:09
  • 2
    A compilable MWE is almost always the fastest way to get a concrete advice from experts here. I know it can be quite cumbersome to strip down your code into a few lines but it's worth the effort. – AboAmmar Jan 21 '21 at 21:25
  • psi is already allocated, but in a different calculation . – fortran1 Jan 21 '21 at 21:46
  • 7
    @fortran1 In that case please show a [mcve]. – Vladimir F Героям слава Jan 21 '21 at 22:06
  • 5
    And show it for both Fortran and Python. As is, nobody can confirm your timings nor explain them. One of the possibilities could be that numpy is compiled with parallelization (either of both of simd and openmp) and not the Fortran code. What's the compiler and the compilation options? What's the version of numpy? –  Jan 22 '21 at 07:44
  • Google says that numpy does not use parallelization. To overcome the issue with the allocate, I avoided this problem with open/read psi/r from a saved txt file, witch i calculated in a seperate program. So there should be no overlap or something similar. And the runtimes are the same again. So the psi/r should be allocated correctly. – fortran1 Jan 22 '21 at 11:34
  • 2
    Then Google is wrong - but more probably, *you* are. The base numpy package has at least [simd](https://numpy.org/devdocs/reference/simd/simd-optimizations.html), and [numpy-mkl](https://pypi.org/project/numpy-mkl/) calls Intel MKL, which has openmp too. –  Jan 22 '21 at 12:06
  • Ok, but in htop I only see one active Processor, while the others are more or less at 0%.(intel i7 4700). – fortran1 Jan 22 '21 at 14:15
  • I can give a small update. I updated all forgotten real parameters to gain and gained some percent, but really not much. – fortran1 Jan 22 '21 at 21:46
  • And i tested the programm on a faster pc. At stepsize 0.005( array length about 1e5) which i want to make my calculations fortran (ifort)is 2,6 times faster( laptop is only as fast as numpy ). And i think that my cpu still limits. So MAYBE the faster the cpu the more gains fortran. If someone has a much faster cpu than i5 6600k, i could send the program, and we can see whether this is correct. – fortran1 Jan 22 '21 at 21:56

1 Answers1

0

Maybe I'm too tired, but why do you need two loops? Both loops have the same number of iterations and you only need the sum up to that index...

!personally I would define the used precission the following way:   
!integer, parameter:: singlep = selected_real_kind(6,37)!Single
integer, parameter:: doublep = selected_real_kind(15,307)!Double

!real(kind=doublep) :: sum2_j, pi=3.14159265359_doublep, N, dr, sum1_i,pi4dr

integer :: i,j,sizepsi
double precision :: sum2_j, pi=3.14159265359, N, dr, sum1_i
double precision, dimension (:), allocatable ::  sum1_array, phi, step1, r
complex(8), dimension(:), allocatable :: psi
!double precision :: start, finish
pi4dr=4.0*dr*pi


sizepsi=size(psi)
allocate (phi(sizepsi))

!call cpu_time(start)

sum1_i=0.0!you shold add the precision here like 0.0_doublep
sum2_j=0.0

do i=1,sizepsi !we already know how large psi is
    sum1_i=sum1_i+r(i)*abs(psi(i))**2
    sum2_j=sum2_j + sum1_i*dr
    phi(i)=pi4dr*sum2_j/r(i)
end do

phi=phi - phi(sizepsi)-N/r(sizepsi) !size(phi)=size(r)=size(psi)

Since your sample code isn't runable, I'm not going to test it and compare the results. Edit: Changed inner loop to a slightly faster version.

Tine198
  • 136
  • 3
  • Thank you, but as a surprise I think it takes even more time with your construction or is at least not faster. The average of 10 measurements says: With your code (dr=0.01): 5*10-5s and with my code 4*10-5s. – fortran1 Jan 22 '21 at 11:38
  • As others already said, without a proper MRE (and compiler options used), testing and giving usefull answers is hard, that being said I was bored so I added code to test the timing (Compiled it with gfortran -O2). The performance depends on the size of psi, for small ones <1E4 Elements there is no real difference and usually your original version is slightly faster (e.g.: 1.5E-8s averaged over 1E7 iterations, for a vector size of 100) for huge vectors the situation is different and my (updated) version is faster (e.g.: 0.2s averaged over 100 runs for a vector size of 1E8). – Tine198 Jan 22 '21 at 14:15
  • Ok thank, but the vector size i want to calculate with is around 1e5, and i can see no difference in the running times. – fortran1 Jan 22 '21 at 21:43