Is there room to further optimize the stochastic_rk Fortran 90 code?

Question

I need to use a Fortran code to solve stochastic differential equation (SDE). I looked at the famous Fortran code website by Burkardt,

https://people.math.sc.edu/Burkardt/f_src/stochastic_rk/stochastic_rk.html

I particular looked at the rk4_ti_step subroutine in stochastic_rk.f90 code,

https://people.math.sc.edu/Burkardt/f_src/stochastic_rk/stochastic_rk.f90

My optimized version is below,

subroutine rk4_ti_step_mod ( x, t, h, q, fi, gi, seed, xstar )
use random  
implicit none
real ( kind = 8 ), external :: fi
real ( kind = 8 ), external :: gi
real ( kind = 8 ) h
real ( kind = 8 ) k1
real ( kind = 8 ) k2
real ( kind = 8 ) k3
real ( kind = 8 ) k4
real ( kind = 8 ) q
real ( kind = 8 ) r8_normal_01
integer ( kind = 4 ) seed
real ( kind = 8 ) t
real ( kind = 8 ) t1
real ( kind = 8 ) t2
real ( kind = 8 ) t3
real ( kind = 8 ) t4
real ( kind = 8 ) w1
real ( kind = 8 ) w2
real ( kind = 8 ) w3
real ( kind = 8 ) w4
real ( kind = 8 ) x
real ( kind = 8 ) x1
real ( kind = 8 ) x2
real ( kind = 8 ) x3
real ( kind = 8 ) x4
real ( kind = 8 ) xstar   
real ( kind = 8 ) :: qoh
real ( kind = 8 ) :: normal(4)
real ( kind = 8 ), parameter :: a21 = 2.71644396264860D+00 &
,a31 = - 6.95653259006152D+00 &
,a32 =   0.78313689457981D+00 &
,a41 =   0.0D+00 &
,a42 =   0.48257353309214D+00 &
,a43 =   0.26171080165848D+00 &
,a51 =   0.47012396888046D+00 &
,a52 =   0.36597075368373D+00 &
,a53 =   0.08906615686702D+00 &
,a54 =   0.07483912056879D+00 &
,q1 =   2.12709852335625D+00 &
,q2 =   2.73245878238737D+00 &
,q3 =  11.22760917474960D+00 &
,q4 =  13.36199560336697D+00
real ( kind = 8 ), parameter, dimension(4) :: qarray = [ 2.12709852335625D+00 &
    ,2.73245878238737D+00 &
    ,11.22760917474960D+00 &
    ,13.36199560336697D+00 ]
real ( kind = 8 ) :: warray(4)
integer (kind = 4) :: i
qoh = q / h
normal = gaussian(4) 
do i =1,4
    warray(i) = normal(i)*sqrt(qarray(i)*qoh)
enddo
t1 = t
x1 = x
k1 = h * ( fi ( x1 ) + gi ( x1 ) * warray(1) ) 
t2 = t1 + a21 * h
x2 = x1 + a21 * k1
k2 = h * ( fi ( x2 ) + gi ( x2 ) * warray(2) )
t3 = t1 + ( a31  + a32 )* h
x3 = x1 + a31 * k1 + a32 * k2
k3 = h * ( fi ( x3 ) + gi ( x3 ) * warray(3) )
t4 = t1 + ( a41  + a42 + a43 ) * h
x4 = x1 + a41 * k1 + a42 * k2
k4 = h * ( fi ( x4 ) + gi ( x4 ) * warray(4) )
xstar = x1 + a51 * k1 + a52 * k2 + a53 * k3 + a54 * k4
return
end

Note that I use my module of random number, and gaussian is my random number function, this part does not matter.

I just wonder,

Can anyone give some suggestions as to can the code be further optimized?
Does anyone know what is the best/fastest SDE Fortran subroutine? Or what algorithm is the best?

Thank you very much!

A few changes should be made, but none will likely improve performance. — steve, Sep 12 '21 at 05:10
To be honest there's not a lot that can be said WRT optimisation without seeing a whole code, or at the very least the functions referenced in the above routine. But I would note you could avoid a small number of sqrts by calculating Sqrt( qoh ) once, and storing the sqrt of qarray in a parameter (making reasonable assumptions about the routine gaussian() ). But there is so little work shown here I doubt it will change much. — Ian Bush, Sep 12 '21 at 07:37
Stylistically note `Real( 8 )` and `Integer( 4 )` are not portable, not guaranteed to be supported by your compiler, and may not do what you expect them to do. See https://stackoverflow.com/questions/838310/fortran-90-kind-parameter for better ways - personally I would recommend using iso_fortran_env noted in the comments. External subprograms are also not best practice nowadays, stick them in modules. — Ian Bush, Sep 12 '21 at 07:40
The first step of optimising your code should always be to profile your code, so you know what parts of the code are taking the most time. Without this information (or a [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example)) there's not a lot we can say to help. — veryreverie, Sep 12 '21 at 08:08
Having said that, I wonder why you are using many scalar variables rather than vectors and matrices? If you roll e.g. `x1` to `x4` and `k1` to `k4` into vectors `x` and `k` (stored as arrays in Fortran) then you can write things like `k = h * (fi(x) + gi(x)*warray)`, which will let the compiler do things like vectorise your code. — veryreverie, Sep 12 '21 at 08:12
What is the tipical use-case of this function? Regarding its used the code could be vectorized to make it much faster (eg. with AVX-512 on x86 platforms, the maximum speed up is about 8 times faster). Alternatively, the compiler can use AVX on x86 platforms (up to 4 times faster) to vectorize the end of the code if the code is tuned to be SIMD-aware which is not currently the case here (because of the function calls and the repeated lines). The div can be replaced by an inverse multiplication too regarding the use-case and so on. Beside this there is no much to do... — Jérôme Richard, Sep 12 '21 at 09:50
@JérômeRichard Can you be specific how to vectorize this code? those x and k have some dependence and i do not see they can be vectorized? — CRquantum, Sep 12 '21 at 17:46
@veryreverie Could you please post your code? I do not see clearly how it can be vectorized. Those x and k have some dependence if you look at the code. — CRquantum, Sep 12 '21 at 17:47
@IanBush Thank you! I know. The code is written by Burcardt. For now I just try to make it faster in this piece of code. Then I will use select kind to replace those kind=8 stuff. — CRquantum, Sep 12 '21 at 17:49
@CRquantum Ha, you are right for the last lines: I did not see the dependance. My bad. Still, if the function is executed for N different values and `gi` can be vectorized as well as `gaussian`, then the whole function can be vectorized (if so, I think OpenMP simd directives should help). — Jérôme Richard, Sep 12 '21 at 20:42
@JérômeRichard Thank you very much! Yes I know a little bit about vectorization. But you know what? I checked, actually nearly in all cases, it seems, just say intel Fortran, the compiler is doing extremely well in optimization the do loop and automatically make the do loop vectorized. Many times my 'vectorized' version performs just the same as the simple do loop version. — CRquantum, Sep 12 '21 at 20:52
Check the solver routines on examples with known solutions, that is, the geometric Brownian motion. The second order routines performed very strangely there, see https://scicomp.stackexchange.com/questions/35253/time-independent-runge-kutta-integration-of-sde — Lutz Lehmann, Sep 17 '21 at 08:57
That is, apply the solver to `dX = a*X*dt + b*X*dW` and then make a histogram for `dX/X`. The values should form a bell curve around `a*dt` with quadratic variation `b^2*dt`. If that is not the case the method is not reliable. — Lutz Lehmann, Sep 17 '21 at 11:57
@LutzLehmann Thank you very much! May I ask, which SDE method is the best now? — CRquantum, Sep 20 '21 at 03:55
Everything that comes from Higham, Kloeden, Platen is well-tested. The julia-lang diffeq package also has higher order methods. Afaik it is rather expensive to get anything that has a strong order greater 1 and a weak order greater 2. Special voodoo is also demanded in anything above Euler-Maruyama if the underlying Brownian motion is multi-dimensional. — Lutz Lehmann, Sep 20 '21 at 05:52

veryreverie · Accepted Answer · 2021-09-16T22:58:48.133

The interdependence of x and c means you can't turn as much into linear algebra as I first thought, but I'd still expect some speedup by grouping everything into appropriate arrays as:

subroutine rk4_ti_step_mod ( x, t, h, q, fi, gi, seed, xstar )
  use random  
  implicit none
  
  integer, parameter :: dp = selected_real_kind(15,307)
  integer, parameter :: ip = selected_int_kind(9)
  
  real(dp), intent(in) :: x
  real(dp), intent(in) :: t  
  real(dp), intent(in) :: h
  real(dp), intent(in) :: q
  real(dp), external :: fi
  real(dp), external :: gi
  integer(ip), intent(in) :: seed
  real(dp), intent(out) :: xstar

  real(dp), parameter :: as(4,5) = reshape([ &
     &  0.0_dp,              0.0_dp,              0.0_dp,              0.0_dp, &
     &  2.71644396264860_dp, 0.0_dp,              0.0_dp,              0.0_dp, &
     & -6.95653259006152_dp, 0.78313689457981_dp, 0.0_dp,              0.0_dp, &
     &  0.0_dp,              0.48257353309214_dp, 0.26171080165848_dp, 0.0_dp, &
     &  0.47012396888046_dp, 0.36597075368373_dp, 0.08906615686702_dp, 0.07483912056879_dp &
     & ], [4,5])
  real(dp), parameter :: qs(4) = [ &
     &  2.12709852335625_dp, &
     &  2.73245878238737_dp, &
     & 11.22760917474960_dp, &
     & 13.36199560336697_dp ]

  real(dp) :: ks(4)
  real(dp) :: r8_normal_01
  real(dp) :: ts(4)
  real(dp) :: ws(4)
  real(dp) :: xs(4)
  real(dp) :: normal(4)
  real(dp) :: warray(4)
  
  normal = gaussian(4) 
  warray = normal*sqrt(qs)*sqrt(q/h)
  
  do i=1,4
    ts(i) = t + sum(as(:i-1,i)) * h
    xs(i) = x + dot_product(as(:i-1,i), ks(:i-1))
    ks(i) = h * (fi(xs(i)) + gi(xs(i))*warray(i))
  enddo
 
  xstar = x + dot_product(as(:,5), ks)
end subroutine

although it's difficult to tell without knowing anything about fi and gi.

Also note you don't seem to be using the t1 to t4 variables.

Thank you very much! The fi = -x, gi=1.0. Indeed, this code was initially written by John Burcardt. In the rk4_ti subroutine (ti means time invariant), t1 to t4 is not really used. I guess he wrote here is for illustration. But in the rk4_tv subroutine (tv means time variant), t1 to t4 are used. — CRquantum, Sep 12 '21 at 20:28
I checked, your version the speed is the same as my version. But yours is more concise. Thank you very much! — CRquantum, Sep 12 '21 at 20:35
May I ask a stupid question, I see your nice code have things like, ks(:i-1). However, when k=1, it looks like it will access ks(:0) which is out of the range. However it does not really happen. Therefore by doing ks(:i-1), is it automatically begin from ks(:i-1) where i-1 >= 1 ? — CRquantum, Sep 13 '21 at 04:10
`ks(:0)` is just an empty slice, equivalent to `[]`. If an optimising compiler unrolls the loop, it should be able to just drop the `sum([])` and `dot_product([],[])` lines. — veryreverie, Sep 13 '21 at 07:16

Is there room to further optimize the stochastic_rk Fortran 90 code?

1 Answers1