How to speed up Numpy sum and Python for-loop?

Question

I have 2 codes that do almost the same thing.

code 1:

from __future__ import division
import numpy as np

m = 1
gamma = 1
lam = 1
alpha = 1
step_num = 2 ** 16
dt = 0.02


def E_and_x(x0):
    xi = x0
    vi = 0
    f = 0
    xsum = 0
    Ei, xavg = 0, 0
    for i in range(step_num):
        vi += f / m * dt / 2
        xi += vi * dt
        f = - gamma * xi - lam * xi ** 2 - alpha * xi ** 3
        vi += f / m * dt / 2
        Ei = 1 / 2 * m * vi ** 2 + 1 / 2 * gamma * xi ** 2 + \
            1 / 3 * lam * xi ** 3 + 1 / 4 * alpha * xi ** 4
        xsum += xi
        xavg = xsum / (i + 1)
    return Ei, xavg

E, x = [], []
for x0 in np.linspace(0, 1, 40):
    mdresult = E_and_x(x0)
    E.append(mdresult[0])
    x.append(mdresult[1])

code 2:

from __future__ import division
import numpy as np
from numba import jit

time = 50
niter = 2 ** 16  # number of iterations
t = np.linspace(0, time, num=niter, endpoint=True)


class MolecularDynamics(object):
    def __init__(self, time, niter, initial_pos):
        self.position = np.array([])
        self.velocity = np.array([])
        self.energy = np.array([])
        self.x_average = np.array([])
        self.vel = 0  # intermediate variable
        self.force = 0  # intermediate variable
        self.e = 0  # intermediate energy
        self.initial_pos = initial_pos  # initial position
        self.pos = self.initial_pos
        self.time = time
        self.niter = niter
        self.time_step = self.time / self.niter
        self.mass = 1
        self.k = 1  # stiffness coefficient
        self.lamb = 1  # lambda
        self.alpha = 1  # quartic coefficient

    @jit
    def iter(self):
        for i in np.arange(niter):
            # step 1 of leap frog
            self.vel += self.time_step / 2.0 * self.force / self.mass
            self.pos += self.time_step * self.vel
            # step 2 of leap frog
            self.force = - self.k * self.pos - self.lamb * self.pos ** 2 - self.alpha * self.pos ** 3
            self.vel += self.time_step / 2.0 * self.force / self.mass

            # calculate energy
            self.e = 1 / 2 * self.mass * self.vel ** 2 + \
                     1 / 2 * self.k * self.pos ** 2 + \
                     1 / 3 * self.lamb * self.pos ** 3 + \
                     1 / 4 * self.alpha * self.pos ** 4

            self.velocity = np.append(self.velocity, [self.vel])  # record vel after 1 time step
            self.position = np.append(self.position, self.pos)  # record pos after 1 time step
            self.energy = np.append(self.energy, [self.e])  # record e after 1 time step
            self.x_average = np.append(self.x_average, np.sum(self.position) / (i + 1))


mds = [MolecularDynamics(time, niter, xx) for xx in np.linspace(0, 1, num=40)]
[md.iter() for md in mds]  # loop to change value
mds_x_avg = [md.x_average[-1] for md in mds]
mds_e = [md.e for md in mds]

Well, the main difference is code 2 uses OO, Numpy and JIT. However, code 2 is much much slower than code 1 (it takes many minutes to compute).

In [1]: %timeit code_1
10000000 loops, best of 3: 25.7 ns per loop

By profiling I know the bottleneck is iter() function and more specifically, on append and sum. But using Numpy is as far as I can do, I wonder why code 2 is much more slower and how can I speed it up?

Whenever you use `np.append` you will throw away the famous numpy-speed. — MSeifert, Jan 28 '17 at 17:22
Unless you can turn your problem into calculations between vectors/matrices/tensors, numpy won't magically make your code faster. You may try pypy or cython to JIT/AOT compile the code though. — kennytm, Jan 28 '17 at 17:23
For simple access and removal numpy arrays are almost 10 times slower that the native python lists. Keeping this in mind you just have to make a choice. — hashcode55, Jan 28 '17 at 17:35

MSeifert · Answer 1 · 2017-01-28T17:51:46.283

You did something wrong with your timings, just testing your first code (slightly modified):

from __future__ import division


def E_and_x(x0):
    m = 1
    gamma = 1
    lam = 1
    alpha = 1
    step_num = 2 ** 13   # much less iterations!
    dt = 0.02

    xi = x0
    vi = 0
    f = 0
    xsum = 0
    Ei, xavg = 0, 0
    for i in range(step_num):
        vi += f / m * dt / 2
        xi += vi * dt
        f = - gamma * xi - lam * xi ** 2 - alpha * xi ** 3
        vi += f / m * dt / 2
        Ei = 1 / 2 * m * vi ** 2 + 1 / 2 * gamma * xi ** 2 + \
            1 / 3 * lam * xi ** 3 + 1 / 4 * alpha * xi ** 4
        xsum += xi
        xavg = xsum / (i + 1)
    return Ei, xavg

The timings are not in the nanosecond regime:

%timeit [E_and_x(x0) for x0 in np.linspace(0, 1, 40)]  # 1 loop, best of 3: 3.46 s per loop

However if numba would be an option I would definetly advise to jit the E_and_x functions:

import numba as nb

numbaE_and_x = nb.njit(E_and_x)

numbaE_and_x(1.2)  # warmup for the jit
%timeit [numbaE_and_x(x0) for x0 in np.linspace(0, 1, 40)]  # 100 loops, best of 3: 3.38 ms per loop

It's already 100 times faster. If you run the first code with PyPy (or Cythonize it) you should get similar results.

Apart from that:

np.append is a terrible option. Because np.append, np.concatenate, np.stack (all variants) need to allocate a new array and copy all other arrays into it! And you don't do anything with these arrays, just append to them. So the only thing you do with numpy is something that numpy is really bad at!
Check if you can vectorize any of the calculations using numpy-arrays (without append, etc.).
And if it's still to slow all these self.xxx attribute accesses are slow, better to read them once and set them again later.

daragua · Accepted Answer · 2017-01-29T17:32:27.793

3

In addition to what MSeifert said, you can preallocate the arrays to the correct size instead of appending to them. So instead of creating them like this:

self.position = np.array([])  # No

you would write:

self.position = np.zeros(niter) # Yes

Then instead of appending like this:

self.velocity = np.append(self.velocity, [self.vel])

you would fill in like this:

self.velocity[i] = self.vel

This avoids reallocating the arrays at each iteration (You can do exactly the same with raw python lists BTW using array = [someValue]*size).

Vectorizability

I went on wondering about vectorizability of the OP's algorithm. It seems that it isn't vectorizable. To quote the Cornell Virtual Workshop :

Read after write ("flow") dependency. This kind of dependency is not vectorizable. It occurs when the values of variables in a particular loop iteration (the "read") are determined by a previous loop iteration (the "write").

The "flow" dependency is seen in the loop, where several members' values are determined by the previous state of that same member. For example:

self.vel += self.time_step / 2.0 * self.force / self.mass

Here, self.force is from the previous iteration and was computed from the previous self.vel.

edited Jan 29 '17 at 17:32

answered Jan 28 '17 at 18:13

daragua

1,133
6
8

But one still has to pay the "unboxing" cost each iteration when you set the item! It might perform better for `position` because `np.sum(position)` is faster than `sum` but for all other "arrays" it's just unnecessary overhead. – MSeifert Jan 29 '17 at 06:00
Yes, but I don't see how to avoid the unboxing overhead unless the algorithm can be vectorized. I haven't found a way to do so, nor found info about general proof-of-vectorizability. This answer just avoids allocation/copy/deletion of the array on each iteration, which is just one step in the optimization process. – daragua Jan 29 '17 at 12:49
Yes and I like your answer. It was only meant as comment (for future readers), not as criticism. :) – MSeifert Jan 29 '17 at 16:14
No problem, I didn't take it as such :) I was really wondering if there were ways to avoid the unboxing in numpy. – daragua Jan 29 '17 at 17:34
Yes, after I changed `self.position = np.array([])` to `self.position = np.zeros(niter)`, and `self.velocity = np.append(self.velocity, [self.vel])` to `self.velocity[i] = self.vel`, there was noticeable speed-up. – Jadim Jan 29 '17 at 21:23
And I think the bottleneck is this statement: `self.x_average = np.append(self.x_average, np.sum(self.position) / (i + 1))`. After I added a statement `self.x_sum = 0` in `__init__` method, and 2 statements: `self.x_sum += self.pos` as well as `self.x_average[i] = self.x_sum / (i + 1)` in `iter()` method, there was another boost in speed. I should have thought about this! – Jadim Jan 29 '17 at 21:26
Seems like a reasonable thing to do. While we're optimizing, we can do better memory-wise: `np.arange` will create a full table of 2**16 integers. Which might not be dramatic. But still, here it is not needed. For Python 2.7 you can use [xrange](http://stackoverflow.com/questions/94935/what-is-the-difference-between-range-and-xrange-functions-in-python-2-x), for Python 3, good old [range](https://docs.python.org/3/library/functions.html?highlight=range#func-range). – daragua Jan 29 '17 at 23:19

How to speed up Numpy sum and Python for-loop?

2 Answers2