How to use Numba to optimize stochastic Runge-Kutta?

Question

I am trying to optimize a stochastic differential equation. What it means for me is that I have to solve a set of Mmax independent differential equations. For that task I use Numba, but I have the feeling I am doing it wrong, as even if I do not get any errors and I get some time-reduction, it is not as much as I expected.

My code looks like


@njit(fastmath=True)
def Liouvillian(rho,t_list,alpha1,alpha1p):
    C = np.zeros((len(rho0)),dtype=np.complex64)
    B = L0+alpha1p*L1m+alpha1*L1p
    for j in range(len(rho0)):
        for i in range(len(rho0)):
            C[j] += B[j, i]*rho[i] 
    return C

@njit(parallel=True, fastmath=True)
def RK4(rho0,t_list,alpha1,alpha1p):
    rho=np.zeros((len(rho0),len(t_list),Mmax),dtype=np.complex64) 
    for m in prange(Mmax):
        rho[:,0,m]=rho0
        n=0
        for t in t_list[0:len(t_list)-1]:
            k1=Liouvillian(rho[:,n,m],t,alpha1[st*n,m],alpha1p[st*n,m])
            k2=Liouvillian(rho[:,n,m]+dt*k1/2,t+dt/2,alpha1[st*n,m],alpha1p[st*n,m])
            k3=Liouvillian(rho[:,n,m]+dt*k2/2,t+dt/2,alpha1[st*n,m],alpha1p[st*n,m])
            k4=Liouvillian(rho[:,n,m]+dt*k3,t+dt,alpha1[st*n,m],alpha1p[st*n,m])
            rho[:,n+1,m]=rho[:,n,m]+dt*(k1+2*k2+2*k3+k4)/6.0
            n=n+1
    return rho

where alpha1p and alpha are 2D arrays with elements [t,m] and L0,Lp,Lm are just arrays.

Specifically taking Mmax=10000, without using Numba it takes 16s, while using Numba (both parallel and fastmath), it reduces to 14s. While I see some reduction, this is not what I expected from the parallelization. Is there something wrong in my code or this is what I should get?

Edit: Here I attach the code I use to generate the random arrays I later use on the ODE solver.

def A_matrix(num_modes,kappa,E_0):
    A_modes=np.zeros(((2*num_modes),(2*num_modes)),dtype=np.complex64)
    for i in range(np.shape(A_modes)[0]):
        for j in range(np.shape(A_modes)[0]):
            if i==j:
                A_modes[i,j]=-kappa/4
                if i==0:
                    A_modes[i,i+num_modes+1]=E_0*kappa/2  
                    A_modes[i,-1]=E_0*kappa/2 
                elif 0<i<num_modes-1:
                    A_modes[i,i+num_modes+1]=E_0*kappa/2
                    A_modes[i,i+num_modes-1]=E_0*kappa/2 
                elif i==num_modes:
                    A_modes[i-1,i]=E_0*kappa/2 
                    A_modes[i-1,-2]=E_0*kappa/2 
    return (A_modes+A_modes.transpose())


def D_mat(num_modes):
    D_matrix=np.zeros(((num_modes),(num_modes)),dtype=np.complex64)
    for i in range(np.shape(D_matrix)[0]):
        for j in range(np.shape(D_matrix)[0]):
            if i==j:
                if i==0:
                    D_matrix[i,i+1]=1
                    D_matrix[i,-1]=1
                elif 0<i<num_modes-1:
                    D_matrix[i,i+1]=1
    return (D_matrix+D_matrix.transpose())

def a_part(alpha,t_list):
    M=A_matrix(num_modes,kappa,E_0)
    return M.dot(alpha)


def b_part(w,t_list):
    D_1=D_mat(num_modes)
    Zero_modes=np.zeros(((num_modes),(num_modes)),dtype=np.complex64)
    D=np.bmat([[D_1, Zero_modes], [Zero_modes, D_1]])
    B=sqrtm(D)*np.sqrt(E_0*kappa/2)
    return B.dot(w)

def SDE_Param_Euler_Mauyrama(Mmax):
    alpha=np.zeros((2*num_modes,Smax+1,Mmax),dtype=np.complex64)
    n=0
    alpha[:,n,:]=0.0+1j*0.0
    for s in s_list[0:len(s_list)-1]:
        alpha[:,n+1,:]=alpha[:,n,:]+ds*a_part(alpha[:,n,:],s)+b_part(w[:,n,:],s)
        n=n+1
    return (alpha)

#Parameters
E_0=0.5
kappa=10.
gamma=1.
num_modes=2 ## number of modes

Mmax=10000 #number of samples

Tmax=20 ##max value for time
dt=1/(2*kappa)

st=10

ds=dt/st
Nmax=int(Tmax/dt) ##number of steps
Smax=int(Tmax/ds) ##number of steps

t_list=np.arange(0,Tmax+dt/2,dt)
s_list=np.arange(0,Tmax+ds/2,ds)


w = np.random.randn(2*num_modes,Smax+1,Mmax)*np.sqrt(ds)

(alpha1,alpha2,alpha1p,alpha2p)=SDE_Param_Euler_Mauyrama(Mmax)

from qutip import *
Delta_a=0

num_qubits=1
##Atom 1
sm_1=sigmam()
sp_1=sigmap()
sx_1=sigmax()
sy_1=sigmay()
sz_1=sigmaz()


state0=np.zeros(2**num_qubits,dtype=np.complex64)
state0[-1]=1

rho0=np.kron(state0,np.conjugate(np.transpose(state0)))

Ha=Delta_a*(sz_1)/2    #Deterministic Liouvillian
L_ops=[np.sqrt(gamma)*sm_1]

L0=np.array(liouvillian(Ha,L_ops),dtype=np.complex64)

L1m=-np.array(np.sqrt(kappa*gamma)*1j*liouvillian(sm_1),dtype=np.complex64)
L1p=np.array(np.sqrt(kappa*gamma)*1j*liouvillian(sp_1),dtype=np.complex64)

rho_1=RK4(rho0,t_list,alpha1,alpha1p).reshape(2**num_qubits,2**num_qubits,len(t_list),Mmax)

Are you sure you are solving SDE here? I see nothing involving randomness. Also, if you have to solve many initial conditions in parallel, there are usually other ways to speed up things depending on why you want to do this. — Wrzlprmft, Apr 27 '23 at 05:21
@Wrzlprmft I edited my post showing how I get the random elements. — J.Agusti, Apr 27 '23 at 10:11

Jérôme Richard · Answer 1 · 2023-04-28T16:25:20.670

I expect the main bottleneck to be temporary arrays, especially since the code is parallel and the amount of computation is small compared to the number of temporary arrays and their size (bigger arrays are slower to allocate). Thus, the key is to avoid temporary arrays by allocating them once and reusing them. You can also use loops to avoid creating temporary arrays implicitly like Numpy does. Allocating temporary array is an expensive operation and filling a newly allocated array is slow because of page faults. For example, k1+2*k2+2*k3+k4 creates many temporary arrays that are not needed. You can write a loop directly writing in rho based on k1, k2, k3 and k4. You could even avoid creating those four arrays in the first place by accumulating values in rho. Memory operations tends to be slower than computations on modern hardware so doing operations in-place and avoiding copy and temporary arrays is often a good idea for performance, especially in parallel codes since the L3 cache and the DRAM are shared between cores.

Besides, there is another issue: the memory access pattern is inefficient. Indeed, rho[:,n,m] means you access items with a stride of n*m*16. Strided memory accesses are far less efficient than contiguous because the hardware is optimized for the later. One solution to avoid repeated strided access is to physically transpose the input/output arrays (note arr.T do not physically transpose an array but returns a non-contiguous view on it, while arr.T.copy() does return contiguous physically transposed array, though it is not very efficiently implemented yet). In your case, the best is certainly to change the layout of rho. The transposition trick is only worth it if other algorithms operating on it before/after this one also need to operate on it contiguously. You can transpose the array once between each algorithm rather than doing twice here (and possibly in other algorithms).

Here is an example for the second point:

@njit(parallel=True, fastmath=True)
def RK4(rho0,t_list,alpha1,alpha1p):
    rho=np.zeros((len(t_list), Mmax, len(rho0)),dtype=np.complex64)
    for m in prange(Mmax):
        rho[0,m,:]=rho0
        n=0
        for t in t_list[0:len(t_list)-1]:
            k1=Liouvillian(rho[n,m,:],t,alpha1[st*n,m],alpha1p[st*n,m])
            k2=Liouvillian(rho[n,m,:]+dt*k1/2,t+dt/2,alpha1[st*n,m],alpha1p[st*n,m])
            k3=Liouvillian(rho[n,m,:]+dt*k2/2,t+dt/2,alpha1[st*n,m],alpha1p[st*n,m])
            k4=Liouvillian(rho[n,m,:]+dt*k3,t+dt,alpha1[st*n,m],alpha1p[st*n,m])
            rho[n+1,m,:]=rho[n,m,:]+dt*(k1+2*k2+2*k3+k4)/6.0
            n=n+1
    return rho

The result is transposed (and so different from the initial function). The computation is about twice faster on my machine thanks to the transposition. Applying the first point should result is an even more significant speed up.

Thank you for your detailed answer, much appreciated. While I understand the first point, I do not seem to understand your second idea. Could you please provide a toy model of what you mean? Important to notice that the first element of rho[:,n,m], as I show now in my edited post, is small but in general the first label will be an array of length 1048576, which is quite demanding. — J.Agusti, Apr 27 '23 at 10:21
I provided an example of code for the second point. Doing the same for the `n`-axis (ie. now the first one) may also speed this up even more. — Jérôme Richard, Apr 28 '23 at 16:26
ah, I was not aware of this problem of the stride memory! Thanks! However, when implementing your answer in my laptop, I actually get the same computational time as before. Can I ask you the size of the matrix you have tested? — J.Agusti, Apr 29 '23 at 10:51
Strange. I juste copy pasted your code for to get the input matrix with no changes. The bottleneck can change from one machine to another but I expect the performance to be at least slightly better. Anyway, the key point is probably the first for performance. — Jérôme Richard, Apr 29 '23 at 18:06
Then I do have one question: I tried to implement your first point and I actually got it a slower performance... Let me see if I understood it: The idea is to avoid creating the k_i variables. So instead we just do something like rho[:,n+1,m]=rho[:,n,m]+dt*(k1+2*k2+2*k3+k4)/6.0 what I should do it rho[:,n+1,m]=rho[:,n,m]+dt/6*(Liouvillian(rho[:,n,m],t,alpha1[st*n,m],alpha1p[st*n,m])+2*Liouvillian(rho[n,m,:]+dt*Liouvillian(rho[:,n,m],t,alpha1[st*n,m],alpha1p[st*n,m])/2,t+dt/2,alpha1[st*n,m],alpha1p[st*n,m]))+etc.. where we are using that k_2 depends on k_1, etc.. — J.Agusti, Apr 29 '23 at 18:20
The code is barely readable in comments, but note that temporary arrays are *not* created when the result is assigned to a variable. Setting a variable is very cheap because a variable only reference an existing object. When you do `v = a * b + c`, `a * b` is first computed producing an unnamed temporary object which is then used to compute `temporary_object + c` resulting in another (possibly temporary) object and `v` reference this last object. The same number of temporary objects is created whether or not the result is assigned to `v`. — Jérôme Richard, May 01 '23 at 15:32
To avoid creating temporary array, you need to use *plain loops* in Numba. The loop can iterate over each item of each array and directly write the resulting item in the resulting output array. For example, something like `for i in range(len(rho0)): rho[n+1,m,i] = rho[n,m,i]+dt*(k1[i]+2*k2[i]+2*k3[i]+k4[i])/6.0`. By the way, multiplying a value to `(1.0/6.0)` might be faster than dividing it by `6.0` because the former value can be precomputed at compile-time and multiplications tends to be faster then divisions. — Jérôme Richard, May 01 '23 at 15:34

How to use Numba to optimize stochastic Runge-Kutta?

1 Answers1