4

Firstly, I'd like to apologize for the badly worded title - I can't currently think of a better way to phrase it. Basically, I'm wondering if there's a faster way to implement an array operation in Python where each operation depends on previous outputs in an iterative fashion (e.g. forward differencing operations, filtering, etc.). Basically, operations that are of a form like:

for n in range(1, len(X)):
    Y[n] = X[n] + X[n - 1] + Y[n-1]

Where X is an array of values, and Y is the output. In this case, Y[0] is assumed to be known or calculated separately before the above loop. My question is: Is there a NumPy functionality to speed up this sort of self-referential loop? This is the major bottleneck in almost all the scripts I have. I know NumPy routines benefit from being executed from C routines, so I was curious if anyone knew of any numpy routines that would help here. Else failing that, are there better ways to program this loop (in Python) that would speed up its execution for large array sizes? (>500,000 data points).

MSeifert
  • 145,886
  • 38
  • 333
  • 352
Yoshi
  • 671
  • 8
  • 20

3 Answers3

12

Accessing single NumPy array elements or (elementwise-)iterating over a NumPy array is slow (like really slow). If you ever want to do a manual iteration over a NumPy array: Just don't do it!

But you got some options. The easiest is to convert the array to a Python list and iterate over the list (sounds silly, but stay with me - I'll present some benchmarks at the end of the answer 1):

X = X.tolist()
Y = Y.tolist()
for n in range(1, len(X)):
    Y[n] = X[n] + X[n - 1] + Y[n-1]

If you also use direct iteration over the lists, it could be even faster:

X = X.tolist()
Y = Y.tolist()
for idx, (Y_n_m1, X_n, X_n_m1) in enumerate(zip(Y, X[1:], X), 1):
    Y[idx] = X_n + X_n_m1 + Y_n_m1

Then there are more sophisticated options that require additional packages. Most notably Cython and Numba, these are designed to work on the array-elements directly and avoid Python overhead whenever possible. For example with Numba you could just use your approach inside a jitted (just-in-time compiled) function:

import numba as nb

@nb.njit
def func(X, Y):
    for n in range(1, len(X)):
        Y[n] = X[n] + X[n - 1] + Y[n-1]

There X and Y can be NumPy arrays but numba will work on the buffer directly, out-speeding the other approaches (possibly by orders of magnitude).

Numba is a "heavier" dependency than Cython, but it can be faster and easier to use. But without conda it's hard to install numba... YMMV

However here's also a Cython version of the code (compiled using IPython magic, it's a bit different if you're not using IPython):

In [1]: %load_ext cython

In [2]: %%cython
   ...:
   ...: cimport cython
   ...:
   ...: @cython.boundscheck(False)
   ...: @cython.wraparound(False)
   ...: cpdef cython_indexing(double[:] X, double[:] Y):
   ...:     cdef Py_ssize_t n
   ...:     for n in range(1, len(X)):
   ...:         Y[n] = X[n] + X[n - 1] + Y[n-1]
   ...:     return Y

Just to give an example (based on the timing framework from my answer to another question), regarding the timings:

import numpy as np
import numba as nb
import scipy.signal

def numpy_indexing(X, Y):
    for n in range(1, len(X)):
        Y[n] = X[n] + X[n - 1] + Y[n-1]
    return Y

def list_indexing(X, Y):
    X = X.tolist()
    Y = Y.tolist()
    for n in range(1, len(X)):
        Y[n] = X[n] + X[n - 1] + Y[n-1]
    return Y

def list_direct(X, Y):
    X = X.tolist()
    Y = Y.tolist()
    for idx, (Y_n_m1, X_n, X_n_m1) in enumerate(zip(Y, X[1:], X), 1):
        Y[idx] = X_n + X_n_m1 + Y_n_m1
    return Y

@nb.njit
def numba_indexing(X, Y):
    for n in range(1, len(X)):
        Y[n] = X[n] + X[n - 1] + Y[n-1]
    return Y


def numpy_cumsum(X, Y):
    Y[1:] = X[1:] + X[:-1]
    np.cumsum(Y, out=Y)
    return Y

def scipy_lfilter(X, Y):
    a = [1, -1]
    b = [1, 1]
    return Y[0] - X[0] + scipy.signal.lfilter(b, a, X)

# Make sure the approaches give the same result
X = np.random.random(10000)
Y = np.zeros(10000)
Y[0] = np.random.random()

np.testing.assert_array_equal(numba_indexing(X, Y), numpy_indexing(X, Y))
np.testing.assert_array_equal(numba_indexing(X, Y), numpy_cumsum(X, Y))
np.testing.assert_almost_equal(numba_indexing(X, Y), scipy_lfilter(X, Y))
np.testing.assert_array_equal(numba_indexing(X, Y), cython_indexing(X, Y))

# Timing setup
timings = {numpy_indexing: [], 
           list_indexing: [], 
           list_direct: [],
           numba_indexing: [],
           numpy_cumsum: [],
           scipy_lfilter: [],
           cython_indexing: []}
sizes = [2**i for i in range(1, 20, 2)]

# Timing
for size in sizes:
    X = np.random.random(size=size)
    Y = np.zeros(size)
    Y[0] = np.random.random()
    for func in timings:
        res = %timeit -o func(X, Y)
        timings[func].append(res)

# Plottig absolute times

%matplotlib notebook
import matplotlib.pyplot as plt

fig = plt.figure(1)
ax = plt.subplot(111)

for func in timings:
    ax.plot(sizes, 
            [time.best for time in timings[func]], 
            label=str(func.__name__))
ax.set_xscale('log')
ax.set_yscale('log')
ax.set_xlabel('size')
ax.set_ylabel('time [seconds]')
ax.grid(which='both')
ax.legend()
plt.tight_layout()

# Plotting relative times

fig = plt.figure(1)
ax = plt.subplot(111)

baseline = numba_indexing # choose one function as baseline
for func in timings:
    ax.plot(sizes, 
            [time.best / ref.best for time, ref in zip(timings[func], timings[baseline])], 
            label=str(func.__name__))
ax.set_yscale('log')
ax.set_xscale('log')
ax.set_xlabel('size')
ax.set_ylabel('time relative to "{}"'.format(baseline.__name__))
ax.grid(which='both')
ax.legend()

plt.tight_layout()

With the following results:

Absolute runtimes

enter image description here

Relative runtimes (compared to the numba function)

enter image description here

So, just by converting it to a list you will be roughly 3 times faster! By iterating directly over these lists you get another (yet smaller) speedup, only 20% in this benchmark but we're almost 4 times faster now compared to the original solution. With numba you can speed it by a factor of more than 100 compared to the list operations! And Cython is only a bit slower than numba (~40-50%), probably because I haven't squeezed out every possible optimization (usually it's not more than 10-20% slower) you could do with Cython. However for large arrays the difference gets smaller.

1 I did go into more details in another answer. That Q+A was about converting to a set but because set uses (hidden) "manual iteration" it also applies here.

I included the timings for the NumPy cumsum and Scipy lfilter approaches. These were roughly 20 times slower for small arrays and 4 times slower for large arrays compared to the numba function. However if I interpret the question correctly you looked for general ways not only ones that applied in the example. Not every self-referencing loop can be implemented using cum* functions from NumPy or SciPys filters. But even then it seems like they can't compete with Cython and/or numba.

MSeifert
  • 145,886
  • 38
  • 333
  • 352
  • 1
    I think I got you beat - can you add timings for your machine with my answer? – o11c Sep 11 '17 at 03:46
  • 1
    I'm going to be modifying my code soon to try each of these solutions, however it was my impression that using lists was slower for super-large arrays? I'll routinely be using data structures with millions of points. What's the performance tradeoff for python lists once we get to that degree? – Yoshi Sep 11 '17 at 03:53
  • 2
    @o11c I included the timing. But I tried to keep it more general. `np.cum*` functions may not be applicable in all cases. It works here but a lot of these self-referencing loops can't be rewritten using numpy functions. – MSeifert Sep 11 '17 at 03:53
  • @Yoshi If you do manual iterations then converting to a list and iterating over the list will _always_ be faster. But lists need much more memory (in most cases 3-10 times more memory), so if memory & performance is a concern you probably need to leverage Cython or Numba to make it work fast **and** memory-efficient. – MSeifert Sep 11 '17 at 03:56
  • 1
    Having a play around with my code now, but even just using @nb.njit on my most computationally-heavy function gave me an overall 20x speedup in my script overall. Woah! Very appreciative, thanks. – Yoshi Sep 11 '17 at 04:50
  • 2
    This is a really great answer by now. I have often wished there is a nice easy library for timing studies like this (comparing different algorithms based on different sizes) which handles everything cleanly. – chthonicdaemon Sep 11 '17 at 14:46
  • 2
    Having come back and read the most updated version of your answer, I really appreciate the depth of it! In fact, Numba has become my new favourite module, and I'm currently working on understanding how best to use it. Since I create computational models for physics research, this sort of easy-to-use performance module is the best thing I've come across, short of writing code in FORTRAN or C++. Also, since I'm a physicist and not a computer engineer! – Yoshi Sep 14 '17 at 14:37
7

It's pretty simple using np.cumsum:

#!/usr/bin/env python3
import numpy as np
import random

def r():
    return random.randint(100, 1000)
X = np.array([r() for _ in range(10)])
fast_Y = np.ndarray(X.shape, dtype=X.dtype)
slow_Y = np.ndarray(X.shape, dtype=X.dtype)
slow_Y[0] = fast_Y[0] = r()

# fast method
fast_Y[1:] = X[1:] + X[:-1]
np.cumsum(fast_Y, out=fast_Y)

# original method
for n in range(1, len(X)):
    slow_Y[n] = X[n] + X[n - 1] + slow_Y[n-1]


assert (fast_Y == slow_Y).all()
o11c
  • 15,265
  • 4
  • 50
  • 75
  • 1
    You might want to use `X = np.random.randint(100, 1000, X.shape)` instead of calling `random.randint` repeatedly. It will be much faster (and shorter) :) Also using `ndarray` directly is kind of a code-smell. You could just use `np.zeros` or `np.copy`. – MSeifert Sep 11 '17 at 11:42
  • @MSeifert It's not code smell at all, this is *exactly* the use case for the `ndarray` constructor. – o11c Sep 12 '17 at 06:09
  • Even the [`numpy.ndarray` documentation](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.html#numpy.ndarray) mentions that you shouldn't use it: "Arrays should be constructed using `array`, `zeros` or `empty` (refer to the See Also section below). The parameters given here refer to a low-level method (ndarray(...)) for instantiating an array." – MSeifert Sep 12 '17 at 06:11
6

The situation you describe is basically a discrete filter operation. This is implemented in scipy.signal.lfilter. The particular condition you describe corresponds to a = [1, -1] and b = [1, 1].

import numpy as np
import scipy.signal

a = [1, -1]
b = [1, 1]

X = np.random.random(10000)
Y = np.zeros(10000)

newY = scipy.signal.lfilter(b, a, X) + (Y[0] - X[0])

On my computer, the timings work out as follows:

%timeit func4(X, Y.copy())
# 100000 loops, best of 3: 14.6 µs per loop

% timeit newY = scipy.signal.lfilter(b, a, X) - (Y[0] - X[0])
# 10000 loops, best of 3: 68.1 µs per loop
chthonicdaemon
  • 19,180
  • 2
  • 52
  • 66
  • Ah, why is my computer so slow.... It seems like one could easily get a 4 times improvement by not using my computer. Note that in order to do what the question asked for you need to use `scipy.signal.lfilter(b, a, X) - X[0] + Y[0]`. It just happened to be `0` in my tests. :) – MSeifert Sep 11 '17 at 11:16
  • @MSeifert Thanks for the fix on the initial values. I'm too used to zero-mean signals. – chthonicdaemon Sep 11 '17 at 14:45
  • You're right, in the context of this question I was asking about ways to improve a filter (4th order Butterworth applied in the time domain), however I wanted to phrase it in such a way that the response could work for *any* recursive matrix operation - for example, I use a great deal of finite differencing in my models. – Yoshi Sep 14 '17 at 14:40
  • @Yoshi finite differences can also be constructed as filters. You may be able to save some time if you construct the convolution of the filters you want to implement first, then do the filtering on the signal (for instance, if you want to run the Butterworth on the derivative of the signal, you can convolve those and run the resulting filter once). – chthonicdaemon Sep 14 '17 at 15:07