Speed up for loop in convolution for numpy 3D array?

Question

Performing convolution along Z vector of a 3d numpy array, then other operations on the results, but it is slow as it is implemented now. Is the for loop what is slowing me down here or is is the convolution? I tried reshaping to a 1d vector and perform the convolution in 1 pass (as I did in Matlab), without the for loop, but it doesn't improve performance. My Matlab version is about 50% faster than anything I can come up with in Python. Relevant section of code:

convolved=np.zeros((y_lines,x_lines,z_depth))
for i in range(0, y_lines):
    for j in range(0, x_lines):
        convolved[i,j,:]= fftconvolve(data[i,j,:], Gauss) #80% of time here
        result[i,j,:]= other_calculations(convolved[i,j,:]) #20% of time here

Is there a better way to do this than a for loop? Heard of Cython but I have limited experience in Python as of now, would aim for the simplest solution.

What is `Gauss`? Some kind of 1-D gaussian kernel? If so, what size relative to `z_depth`? — Curt F., Aug 15 '15 at 21:11
Gaussian kernel generated once, before the loop. Data is 1d vector (z_depth) typically around 1535 elements long, with 1D gaussian kernel of length typically 79. I cleaned up a bunch of the overhead in fftconvolve, basically just goes directly to irfftn(rfftn(in1, fshape) * rfftn(in2, fshape), fshape)[fslice].copy() — user4547612, Aug 15 '15 at 21:27

Curt F. · Answer 1 · 2015-08-15T22:09:50.507

The fftconvolve function you are using is presumably from SciPy. If so, be aware that it takes N-dimensional arrays. So a faster way to do your convolution would be to generate the 3d kernel that corresponds to doing nothing in the x and y dimensions and doing a 1d gaussian convolution in z.

Some code and timing results are below. On my machine and with some toy data, this led to a 10× speedup, as you can see:

import numpy as np
from scipy.signal import fftconvolve
from scipy.ndimage.filters import gaussian_filter

# use scipy filtering functions designed to apply kernels to isolate a 1d gaussian kernel
kernel_base = np.ones(shape=(5))
kernel_1d = gaussian_filter(kernel_base, sigma=1, mode='constant')
kernel_1d = kernel_1d / np.sum(kernel_1d)

# make the 3d kernel that does gaussian convolution in z axis only
kernel_3d = np.zeros(shape=(1, 1, 5,))
kernel_3d[0, 0, :] = kernel_1d

# generate random data
data = np.random.random(size=(50, 50, 50))

# define a function for loop based convolution for easy timeit invocation
def convolve_with_loops(data):
    nx, ny, nz = data.shape
    convolved=np.zeros((nx, ny, nz))
    for i in range(0, nx):
        for j in range(0, ny):
            convolved[i,j,:]= fftconvolve(data[i, j, :], kernel_1d, mode='same') 
    return convolved

# compute the convolution two diff. ways: with loops (first) or as a 3d convolution (2nd)
convolved = convolve_with_loops(data)
convolved_2 = fftconvolve(data, kernel_3d, mode='same')

# raise an error unless the two computations return equivalent results
assert np.all(np.isclose(convolved, convolved_2))

# time the two routes of the computation
%timeit convolved = convolve_with_loops(data)
%timeit convolved_2 = fftconvolve(data, kernel_3d, mode='same')

timeit results:

10 loops, best of 3: 198 ms per loop
100 loops, best of 3: 18.1 ms per loop

Try generating data of length 64 and see if this makes it faster. FFT is usually significantly more efficient on powers of two. — cxrodgers, Aug 15 '15 at 22:32
Implemented a 3D version, it is slower than what I had: Time convolving: 5851.7 ms (New 3D version) Time convolving: 4093.4 ms (Old version) — user4547612, Aug 15 '15 at 22:47
cxrodgers: fftconvolve is using def _next_regular(target): to find the optimal size of data (here 1620 for a 1535 element vector, pad with zeros). — user4547612, Aug 15 '15 at 22:55
What is the exact shape of everyone's data? The 3-D `fftconvolve` stays faster for me up to 256 x 256 x 256 - sized array, although by slightly less than 2x instead of the 10x for the code I posted above. — Curt F., Aug 15 '15 at 23:12
Data is (minimum size) 200x200x1535, gaussian is 1x79. As I mentionned I removed all the overhead from fftconvolve (if statements etc) so it goes directly to the convolution, shaved about 30% time off the unmodified fftconvolve. — user4547612, Aug 15 '15 at 23:29
Interesting, I tried 200x200x1535 convolution with a 3d gaussian kernel of only 1x5 shape, and my solution got about twofold worse than the for loops. I guess the implication is that the optimal solution would be to use `for` loops to go over chunks of the data, not just one pixel at a time, and in each chunk use my "3d" method. — Curt F., Aug 16 '15 at 00:01

score 2 · Answer 2 · edited May 23 '17 at 11:58

I think you have already found the source code of the fftconvolve function. Normally, for real inputs, it uses the numpy.fft.rfftn and .irfftn functions, which compute N-dimensional transforms. Since the goal is to do multiple 1-D transformations, you can basically rewrite fftconvolve like this (simplified):

from scipy.signal.signaltools import _next_regular

def fftconvolve_1d(in1, in2):

    outlen = in1.shape[-1] + in2.shape[-1] - 1
    n = _next_regular(outlen)

    tr1 = np.fft.rfft(in1, n)
    tr2 = np.fft.rfft(in2, n)
    out = np.fft.irfft(tr1 * tr2, n)

    return out[..., :outlen].copy()

And calculate the desired result:

result = fftconvolve_1d(data, Gauss)

This works because numpy.fft.rfft and .irfft (notice the lack of n in the name) transform over a single axis of the input array (the last axis by default). This is roughly 40% faster than the OP code on my system.

Further speedup can be achieved by using a different FFT back-end.

For one, the functions in scipy.fftpack appear to be somewhat faster than their Numpy equivalents. However, the output format of the Scipy variants is pretty awkward (see docs) and this makes it hard to do the multiplication.

Another possible backend is FFTW through the pyFFTW wrapper. Downsides are that transforms are preceded by a slow "planning phase" and that inputs have to be 16-byte aligned to achieve best performance. This is explained pretty well in the pyFFTW tutorial. The resulting code could be for example:

from scipy.signal.signaltools import _next_regular
import pyfftw
pyfftw.interfaces.cache.enable()  # Cache for the "planning"
pyfftw.interfaces.cache.set_keepalive_time(1.0)

def fftwconvolve_1d(in1, in2):

    outlen = in1.shape[-1] + in2.shape[-1] - 1
    n = _next_regular(outlen)

    tr1 = pyfftw.interfaces.numpy_fft.rfft(in1, n)
    tr2 = pyfftw.interfaces.numpy_fft.rfft(in2, n)

    sh = np.broadcast(tr1, tr2).shape
    dt = np.common_type(tr1, tr2)
    pr = pyfftw.n_byte_align_empty(sh, 16, dt)
    np.multiply(tr1, tr2, out=pr)
    out = pyfftw.interfaces.numpy_fft.irfft(pr, n)

    return out[..., :outlen].copy()

With aligned inputs and cached "planning" I saw a speedup of almost 3x over the code in the OP. Memory alignment can be easily checked by looking at the memory address in the ctypes.data attribute of a Numpy array.

Replacing rfftn by rfft improved performance by about 30%. The pyfftw method doesn't help however: pyFFTW: 6.3 sec numpy rfft: 4.6 sec pyFFTW: 86.1 sec numpy rfft: 62.4 sec — user4547612, Aug 18 '15 at 19:42

Speed up for loop in convolution for numpy 3D array?

2 Answers2

Linked