Improve performance of Velocity-Verlet integration algorithm

Question

I am trying to integrate orbits of stars using a Velocity-Verlet integrator inside a potential. However, my current algorithm takes around 25 seconds per orbit, I aim to integrate around 1000 orbits which would take around 7 hours. Therefore I hope someone can help me to optimize the algorithm:

The Velocity-Verlet integrator is defined as (see verlet_integration() for an implementation):

I want to integrate orbits using a Simple Logarithmic Potential (SLP) defined as:

The value v_0 is always equal to 1 and the value q can either be 0.7 or 0.9 (see SLP()).

By using this potential, the acceleration can be calculated which is necessary for this integrator (see apply_forces())

After this I choose x = 0 as a starting value for all orbits, and the energy E = 0 as well. Using that starting values for vx can be calculated (see calc_vx())

To have orbits that are accurate enough, the time step needs to be 1E-4 or smaller. I need to analyze these orbits later, for that they need to be sufficiently long enough, thus I integrate between t=0 and t=200.

I need to calculate values for the entire allowed phase space of (y, vy). The allowed phase space is where the method calc_vx() doesn't result in a square root of a negative number. Thus for that a large number of orbits need to be integrated. I hope to be able to calculate at least 1000 values. But 10,000 would definitively be more ideal. Although maybe that's asking for too much. If you have any ideas for performance improvements, please let me know. Using another language is not an option, so please don't suggest that.

An example of how these orbits look can be seen here:

Everything necessary to run the code should be found below, as well as starting values to run it.

UPDATE: I have implemented the suggestions by mivkov, this reduced the time to 9 seconds, 3 seconds faster, but it's still quite slow. Any other suggestions are still welcome

import numpy as np


def calc_vx(y, vy, q):
    """
    Calculate starting value of x velocity
    """
    vx2 = -np.log((y / q) ** 2) - vy ** 2
    return np.sqrt(vx2)


def apply_forces(x, y, q):
    """
    Apply forces to determine the accelerations
    """
    Fx = -x / (y ** 2 / q ** 2 + x ** 2)
    Fy = -y / (q ** 2 * x ** 2 + y ** 2)
    return Fx, Fy


def verlet_integration(start, dt, steps, q):
    # initialise an array and set the first value to the starting value
    vals = np.zeros((steps, *np.shape(start)))
    vals[0] = start

    # run through all elements and apply the integrator to each value
    for i in range(steps - 1):
        x_vec, v_vec, a_vec = vals[i]
        new_x_vec = x_vec + dt * (v_vec + 0.5 * a_vec * dt)
        new_a_vec = apply_forces(*new_x_vec, q)
        new_v_vec = v_vec + 0.5 * (a_vec + new_a_vec) * dt
        vals[i + 1] = new_x_vec, new_v_vec, new_a_vec

    # I return vals.T so i can use the following to get arrays for the position, velocity and acceleration
    # ((x, vx, ax), (y, vy, ay)) = verlet_integration_vec( ... )
    return vals.T


def integration(y, vy, dt, t0, t1, q):
    # calculate the starting values
    vx = calc_vx(y, vy, q)
    ax, ay = apply_forces(0, y, q)
    start = [(0, y), (vx, vy), (ax, ay)]
    steps = round((t1 - t0) / dt)  # bereken het aantal benodigde stappen

    e = verlet_integration(start, dt, steps, q)  # integreer
    return e

((x, vx, ax), (y, vy, ay)) = integration(0.1, 0.2, 1E-4, 0, 100, 0.7)

Oh boy, there's a lot to unpack here. Your code looks nice and clean, but first and foremost, if you want performance, python is not the way to go. Have you considered using a HPC language like C or Fortran? You could perform the calculations with it, and write a formatted file with the results, to then use python to plot it quickly. — mivkov, Jun 02 '22 at 15:40
Aside from that, there are several things you could do. For example, remove all the "unnecessary" thins that you do. While they make your code look nice, they use unnecessary ressources. For example: a) don't define functions within functions. b) `apply_forces` returns a list. Why? Just return two values. c) `func_verlet` returns an array (does it need to be an array? Can't you return just 3 values?) and on top of that, you first define the array, then make numpy transpose it. Just define the array as transposed already, if you really need one. — mivkov, Jun 02 '22 at 15:46
Unfortunately, this is an assignment for an astronomy class. We need to use python inside jupyter notebooks. Another language is not allowed. I will try to add the changes you suggested, thank you — Viktor VN, Jun 02 '22 at 15:48
The way to go (and gather experience) with these kind of issues is to use timers and time how much time you spend in which functions, and see where you most likely can save some time, and where it would be most efficient to invest time into optimizing. — mivkov, Jun 02 '22 at 15:50
Finally, if you need more speed, there are some other tricks. For example numpy is great at vectorized operations, but given that you only integrate one point over time, that won't do you much help. Another option is to look into the numba library. There are options to "pre-compile" your functions when they are used, which may improve your performance significantly. — mivkov, Jun 02 '22 at 15:50
I got the error `verlet_integration() takes 4 positional arguments but 5 were given` while running your code. The number of parameter indeed does not match. I fixed this locally but I got then other errors like `func_verlet() missing 1 required positional argument: 'q'` certainly because the number of parameter does not match while calling `f`. Can you fix the errors to the can can run properly? — Jérôme Richard, Jun 02 '22 at 16:24

score 3 · Answer 1 · answered Jun 02 '22 at 17:41

3

I see two things which can help here.

(1) Consider using an ODE solver from a library instead of the Verlet method which you wrote by hand. Runge-Kutta methods and variations on that (e.g. Runge-Kutta-Fehlberg) are widely applicable, I'd try that first. It is likely that a library method will be written in C code, and therefore much faster, and also someone has already worked out the bugs. Both aspects will be useful for this problem.

(2) If you are required to write the ODE solver yourself, an idea to speed up the implementation in Python is to vectorize over trajectories so that you can make use of Numpy array operations. That is, create an array which represents all 1000 or 10,000 trajectories, and then advance all trajectories one step at a time. I am assuming you can rewrite the equations of motion in matrix form.

As an aside, my advice is to keep the solution you have already worked out, since it seems to be working, and start on a separate implementation using ideas (1) or (2), and use your original implementation to verify the results.

answered Jun 02 '22 at 17:41

Robert Dodier

16,905
2
31
48

1

(2) only applies for methods with fixed time steps or for collections of solutions that are guaranteed to remain close over the whole integration interval. – Lutz Lehmann Jun 02 '22 at 18:23
Hi thank you for your comment. The first is not an option, the ODE solver needs to be written ourselves. After some more improvements for 1 orbit I got the time down to 3.5 seconds. I also vectorized the function, which worked. But I cannot do vectorized operations on say 1000 orbits cause my ram gets filled instantly 1000 times 6x20 million 64 bit floats is something like 16 gigabytes I think? Anyway I see my ram being filled and my computer freezes. How could you get around that? Just split them up in batches or are there better ways? – Viktor VN Jun 02 '22 at 19:12
1

Viktor, okay, you have to write your own solver, understood. About vectoring over orbits, if 1000 is too many, try 100, or even 10. Also, try reducing the final time from 200 seconds to 2 seconds or something until you get the code working. You could set it up so that the final values of the orbits become the initial conditions for a new solution over t = (previous t final) to t = (previous t final + a small number of seconds), then just store chunks of orbits until you get all the chunks processed. – Robert Dodier Jun 02 '22 at 20:30
@LutzLehmann well, sure, but OP's equations are a fixed time step ODE solver. I guess I could say that explicitly, sorry, I was lazy. – Robert Dodier Jun 02 '22 at 20:32
@ViktorVN : Even if the required time step itself is small, you should check what time steps are actually used in the ensuing statistical evaluation. The you can only save values with a correspondingly coarser time resolution, combining 10 or 100 integration steps for one saved state. This should help with the memory for simultaneous trajectories. – Lutz Lehmann Jun 16 '22 at 08:15

Jérôme Richard · Answer 2 · 2022-06-02T17:57:49.340

The first thing to do is to use a profiler to analyse the hotspots in your code. You can use cProfile.run to do that. A basic report shows that nearly all the time is spent in the functions: x_func, v_func and apply_forces (called 200_000 times):

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   200000    0.496    0.000    0.496    0.000 apply_forces
   199999    0.279    0.000    2.833    0.000 func_verlet
   199999    0.713    0.000    0.713    0.000 x_func
   199999    0.780    0.000    0.780    0.000 v_func
        1    0.000    0.000    0.000    0.000 SLP
   199999    0.222    0.000    0.718    0.000 a_func
        1    0.446    0.446    3.279    3.279 verlet_integration
[...]

A quick analysis of x_func and v_func show that you are calling Numpy functions on tiny array. The thing is Numpy is not optimized to deal with such a small array (Numpy make many checks on the input that are expensive compared to the computation time on tiny arrays).

The main way to address this problem is to use vectorized Numpy calls operating on much bigger array. The thing is this solution require to completely redesign your code so to get ride of the for i in range(steps - 1) loop which is the root of the problem.

An alternative solution is to use Numba so to avoid the Numpy overhead. This solution is much simpler (though your teacher may not expect this if the goal is to learn Numpy). Here is an example:

import numpy as np
import numba as nb


@nb.njit
def SLP(x, y, v0, q):
    return 0.5 * v0 ** 2 * np.log(x ** 2 + (y / q) ** 2)


@nb.njit
def calc_vx(x, y, vy, q):
    """
    Calculate starting value of x velocity
    """
    vx2 = -2 * SLP(x, y, 1, q) - vy ** 2
    return np.sqrt(vx2)


@nb.njit
def apply_forces(x, y, v0, q):
    """
    Apply forces to determine the accelerations
    """
    Fx = -(v0 ** 2 * x) / (y ** 2 / q ** 2 + x ** 2)
    Fy = -(v0 ** 2 * y) / (q ** 2 * x ** 2 + y ** 2)
    return np.array([Fx, Fy])


@nb.njit
def x_func(x_vec, v_vec, a_vec, dt):
        return x_vec + dt * (v_vec + 0.5 * a_vec * dt)

@nb.njit
def v_func(v_vec, a_vec, new_a_vec, dt):
    return v_vec + 0.5 * (a_vec + new_a_vec) * dt

@nb.njit
def a_func(x_vec, dt, q):
    x, y = x_vec
    return apply_forces(x, y, 1, q)


# The parameter is a signature of the function that provides the input type to 
# Numba so it can eagerly compile the function and all the dependent functions.
# Please read the Numba documentation for more information about this.
@nb.njit('(float64[:], float64[:], float64[:], float64, float64)')
def func_verlet(x_vec, v_vec, a_vec, dt, q):
    # calculate the new position, velocity and acceleration
    new_x_vec = x_func(x_vec, v_vec, a_vec, dt)
    new_a_vec = a_func(new_x_vec, dt, q)
    new_v_vec = v_func(v_vec, a_vec, new_a_vec, dt)
    out = np.empty((len(new_x_vec), 3))
    out[:,0] = new_x_vec
    out[:,1] = new_v_vec
    out[:,2] = new_a_vec
    return out


def verlet_integration(start, f, dt, steps, q):
    # initialise an array and set the first value to the starting value
    vals = np.zeros((steps, *np.shape(start)))
    vals[0] = start

    # run through all elements and apply the integrator to each value
    for i in range(steps - 1):
        vals[i + 1] = f(*vals[i].T, dt, q)

    # I return vals.T so i can use the following to get arrays for the position, velocity and acceleration
    # ((x, y), (vx, vy), (ax, ay)) = verlet_integration_vec( ... )
    return vals.T


def integration(y, vy, dt, t0, t1, q):
    # calculate the starting values
    x = 0
    vx = calc_vx(x, y, vy, q)
    ax, ay = apply_forces(x, y, 1, q)
    start = [(x, vx, ax), (y, vy, ay)]
    steps = round((t1 - t0) / dt)  # calculate the number of necessary steps

    return verlet_integration(start, func_verlet, dt, steps, q)


((x_, y_), (vx_, vy_), (ax_, ay_)) = integration(0.1, 0.2, 1E-4, 0, 20, 0.7)

Note that some function have been modified so to avoid the use of function/operators not supported by Numba. For example, the unary * operator to unwrap Numpy array or lists is unsupported but it is generally quite inefficient anyway.

The resulting code is 5 times faster.

The profiler now show that the function verlet_integration is responsible for a major part of the execution due to the overhead of the for loop (including the * operator and function calls). This part cannot easily be ported to Numba due to the lambda. I think it can be made twice faster if you succeed to redesign this part to avoid the lambda and the unwrapping. In fact, operating on array of 2 items is pretty inefficient, even with Numba. Operating on scalar will make the code a bit less readable but much faster (certainly both with and without Numba). I guess the code can be made several time faster again.

UPDATE: with the updated code, Numba can help much better since the main performance bottleneck is now fixed. Here is the new Numba version:

import numpy as np
import numba as nb


@nb.njit
def calc_vx(y, vy, q):
    vx2 = -np.log((y / q) ** 2) - vy ** 2
    return np.sqrt(vx2)


@nb.njit
def apply_forces(x, y, q):
    Fx = -x / (y ** 2 / q ** 2 + x ** 2)
    Fy = -y / (q ** 2 * x ** 2 + y ** 2)
    return np.array([Fx, Fy])


@nb.njit('(float64[:,:], float64, int_, float64)')
def verlet_integration(start, dt, steps, q):
    vals = np.zeros((steps, 3, 2))
    vals[0] = start

    for i in range(steps - 1):
        x_vec, v_vec, a_vec = vals[i]
        new_x_vec = x_vec + dt * (v_vec + 0.5 * a_vec * dt)
        x, y = new_x_vec
        new_a_vec = apply_forces(x, y, q)
        new_v_vec = v_vec + 0.5 * (a_vec + new_a_vec) * dt
        vals[i + 1, 0] = new_x_vec
        vals[i + 1, 1] = new_v_vec
        vals[i + 1, 2] = new_a_vec

    return vals.T


def integration(y, vy, dt, t0, t1, q):
    vx = calc_vx(y, vy, q)
    ax, ay = apply_forces(0, y, q)
    start = [(0, y), (vx, vy), (ax, ay)]
    steps = round((t1 - t0) / dt)

    e = verlet_integration(np.array(start), dt, steps, q)
    return e


((x, vx, ax), (y, vy, ay)) = integration(0.1, 0.2, 1E-4, 0, 100, 0.7)  # 9.7

This is 36 times faster than the updated code of the question. It takes just 0.27 second on my machine as opposed to 9.7 for the initial code.

I had a very similar solution to yours, except I "inlined" *apply_forces*. Can you reproduce my benchmark results on your machine (~2x speed up on google colab)? — Michael Szczesny, Jun 02 '22 at 18:23
@MichaelSzczesny Yeah I can reproduce it and it is indeed twice faster. See the related comment in your question for more information :) . — Jérôme Richard, Jun 02 '22 at 18:29
How would you vectorize the for loop? I thought about that, but you need the previous element of the array, to get the next, so I fail to see how you could do that with a vectorized operation. — Viktor VN, Jun 02 '22 at 20:01

score 2 · Answer 3 · answered Jun 02 '22 at 18:13

Inlining apply_forces speeds up @JérômeRichard's solution by ~2x in my benchmarks.

import numpy as np
import numba as nb   #tested with numba 0.55.2

@nb.njit
def calc_vx(y, vy, q):
    vx2 = -np.log((y / q) ** 2) - vy ** 2
    return np.sqrt(vx2)

@nb.njit
def verlet_integration(start, dt, steps, q):
    vals = np.zeros((steps, 3, 2))
    vals[0] = start

    for i in range(steps - 1):
        x_vec, v_vec, a_vec = vals[i]
        x, y = x_vec + dt * (v_vec + 0.5 * a_vec * dt)
        ax, ay =  -x / (y ** 2 / q ** 2 + x ** 2), -y / (q ** 2 * x ** 2 + y ** 2)
        vx, vy = v_vec[0] + 0.5 * (a_vec[0] + ax) * dt, v_vec[1] + 0.5 * (a_vec[1] + ay) * dt

        vals[i + 1, 0] = x, y
        vals[i + 1, 1] = vx, vy
        vals[i + 1, 2] = ax, ay
    return vals.T

@nb.njit
def integration(y, vy, dt, t0, t1, q):
    vx = calc_vx(y, vy, q)
    ax, ay = 0, -y / y**2
    start = np.array([[0, y], [vx, vy], [ax, ay]])
    steps = round((t1 - t0) / dt) 

    return verlet_integration(start, dt, steps, q)

((x, vx, ax), (y, vy, ay)) = integration(0.1, 0.2, 1E-4, 0, 100, 0.7)

The forces computation could be made less redundant by setting `q2=q**2` outside the loop and then computing `fac = -1/(q2*x**2 + y**2); ax,ay = fac*q2*x, fac*y`. — Lutz Lehmann, Jun 02 '22 at 18:27
Note that this is not the inlining which make it faster but the fact that operations are done on scalar directly rather than on Numpy arrays. Numba is not able to allocate array on the stack yet so they are allocated on the heap which is quite expensive for arrays with only 2 items. I hope future implementations will implement this (this is a known long-standing issue of the current implementation). Good optimization! — Jérôme Richard, Jun 02 '22 at 18:28

Improve performance of Velocity-Verlet integration algorithm

3 Answers3