How to np.roll() faster?

Question

I'm using np.roll() to do nearest-neighbor-like averaging, but I have a feeling there are faster ways. Here is a simplified example, but imagine 3 dimensions and more complex averaging "stencils". Just for example see section 6 of this paper.

Here are a few lines from that simplified example:

for j in range(nper):
    phi2 = 0.25*(np.roll(phi,  1, axis=0) +
                 np.roll(phi, -1, axis=0) +
                 np.roll(phi,  1, axis=1) +
                 np.roll(phi, -1, axis=1) )
    phi[do_me] = phi2[do_me]

So should I be looking for something that returns views instead of arrays (as it seems roll returns arrays)? In this case is roll initializing a new array each time it's called? I noticed the overhead is huge for small arrays.

In fact it's most efficient for arrays of [100,100] to [300,300] in size on my laptop. Possibly caching issues above that.

Would scipy.ndimage.interpolation.shift() perform better, the way it is implemented here, and if so, is it fixed? In the linked example above, I'm throwing away the wrapped parts anyway, but might not always.

note: in this question I'm looking only for what is available within NumPy / SciPy. Of course there are many good ways to speed up Python and even NumPy, but that's not what I'm looking for here, because I'm really trying to understand NumPy better. Thanks!

For nearest neighbor averaging, have you considered convolution? — Andrew, Aug 25 '15 at 04:59
[Laplacian filter](http://docs.scipy.org/doc/scipy/reference/generated/scipy.ndimage.filters.laplace.html) — Paul, Aug 25 '15 at 05:11
`np.roll` constructs an index array, and returns `a.take(indexes, axis)`. So that's the slower advanced indexing. `striding` has been suggested as a faster method of taking sliding windows. — hpaulj, Aug 25 '15 at 05:44
Based on @Andrew 's suggestion I looked at `scipy.ndimage.convolve()` and it's quite flexible on boundaries, number of dimensions, and the kernel configuration. I'll try to do some speed checks. See also the link to a paper I've just added above. — uhoh, Aug 25 '15 at 09:00
There are some examples of constructing sliding windows using strides [here](http://stackoverflow.com/q/21743871/1461210) and [here](http://stackoverflow.com/q/18424900/1461210). It's not totally clear from your question what you're looking for - it would be helpful if you showed us your current solution. As well as `scipy.ndimage.convolve` you might simply use `scipy.ndimage.uniform_filter` if you just want nearest-neighbour averaging. — ali_m, Aug 25 '15 at 12:03
@ali_m The code in the linked example given uses four `np.roll()`s to implement a stencil (or kernel) that looks like `[[0,1,0], [1,0,1], [0,1,0]]`and the link to the paper shows an example of a different stencil (or kernel). The two links you gave are really helpful! And the GIF in the first one is great! — uhoh, Aug 25 '15 at 12:23
I've looked at `scipy.ndimage.filters.convolve()` and the results are "interesting" in that it is sometimes faster and sometimes slower depending on a few things. So I will update the question in a bit... — uhoh, Aug 25 '15 at 14:50

FirefoxMetzger · Answer 1 · 2020-08-04T05:16:41.147

np.roll has to create a copy of the array each time, which is why it is (comparatively) slow. Convolution with something like scipy.ndimage.filters.convolve() will be a bit faster, but it may still create copies (depending on the implementation).

In this particular case, we can avoid copying altogether by using numpy views and padding the original array in the beginning.

import numpy as np


def no_copy_roll(nx, ny):
    phi_padded = np.zeros((ny+2, nx+2))
    # these are views into different sub-arrays of phi_padded
    # if two sub-array overlap, they share memory
    phi_north = phi_padded[:-2, 1:-1]
    phi_east = phi_padded[1:-1, 2:]
    phi_south = phi_padded[2:, 1:-1]
    phi_west = phi_padded[1:-1, :-2]
    phi = phi_padded[1:-1, 1:-1]

    do_me = np.zeros_like(phi, dtype='bool')
    do_me[1:-1, 1:-1] = True

    x0, y0, r0 = 40, 65, 12
    x = np.arange(nx, dtype='float')[None, :]
    y = np.arange(ny, dtype='float')[:, None]
    rsq = (x-x0)**2 + (y-y0)**2
    circle = rsq <= r0**2
    phi[circle] = 1.0
    do_me[circle] = False

    n, nper = 100, 100
    phi_hold = np.zeros((n+1, ny, nx))
    phi_hold[0] = phi
    for i in range(n):
        for j in range(nper):
            phi2 = 0.25*(phi_south +
                         phi_north +
                         phi_east +
                         phi_west)

            phi[do_me] = phi2[do_me]

        phi_hold[i+1] = phi

    return phi_hold

This will cut about 35% off the time for a simple benchmark like

from original import original_roll
from mwe import no_copy_roll
import numpy as np

nx, ny = (301, 301)
arr1 = original_roll(nx, ny)
arr2 = no_copy_roll(nx, ny)

assert np.allclose(arr1, arr2)

here is my profiling result

37.685 <module>  timing.py:1
├─ 22.413 original_roll  original.py:4
│  ├─ 15.056 [self]
│  └─ 7.357 roll  <__array_function__ internals>:2
│     └─ 7.243 roll  numpy\core\numeric.py:1110
│           [10 frames hidden]  numpy
├─ 14.709 no_copy_roll  mwe.py:4
└─ 0.393 allclose  <__array_function__ internals>:2
   └─ 0.393 allclose  numpy\core\numeric.py:2091
         [2 frames hidden]  numpy
            0.391 isclose  <__array_function__ internals>:2
            └─ 0.387 isclose  numpy\core\numeric.py:2167
                  [4 frames hidden]  numpy

For more elaborate stencils this approach still works, but it can get a bit unwieldy. In this case you can take a look at skimage.util.view_as_windows, which uses a variation of this trick (numpy stride tricks) to return an array that gives you cheap access to a window around each element. You will, however, have to do your own padding, and will need to take care to not create copies of the resulting array, which can get expensive fast.

`+1` Wow thank you for taking the time to write such a thorough and thoughtful answer! I've been avoiding learning about skimage for years, maybe this is the excuse I need to dig in and find out what's there :-) — uhoh, Aug 04 '20 at 00:18
btw while I've used Python almost every day between the day I posted this question almost five years ago and today, I'm still an amateur. The line `Program: timing.py` confuses me, is that something that runs as-is? — uhoh, Aug 04 '20 at 00:23
@uhoh `timing.py` is the name I gave to the above benchmark script to compare the original version and the zero-copy version of the code. Sorry, if that confused you. It's part of the output of `pyinstrument` a pretty good python profiler. — FirefoxMetzger, Aug 04 '20 at 05:16
Thanks, I appreciate your patience and will try all of this out soon! — uhoh, Aug 04 '20 at 08:07

milembar · Answer 2 · 2021-04-19T13:39:34.490

1

The fastest implementation I could get so far is based on the low-level implementation of scipy.ndimage.interpolation.shift that you already mentioned:

from scipy.ndimage.interpolation import _nd_image, _ni_support

cval = 0.0  # unused for mode `wrap`
mode = _ni_support._extend_mode_to_code('wrap')
_nd_image.zoom_shift(data, None, shift, data, 0, mode, cval)  # in-place update

Precomputing mode, cval and shift to call low-level zoom_shift method directly got me about x5 speedup wrt. calling shift instead, and x10 speedup wrt np.roll.

edited Apr 19 '21 at 13:39

answered Apr 19 '21 at 13:24

milembar

919
13
17

This problem comes up for me at regular intervals, I'll try this next time and let you know how it goes. *Thanks!* – uhoh Apr 19 '21 at 13:37

How to np.roll() faster?

2 Answers2

Linked