I think you have already found the source code of the fftconvolve
function. Normally, for real inputs, it uses the numpy.fft.rfftn
and .irfftn
functions, which compute N-dimensional transforms. Since the goal is to do multiple 1-D transformations, you can basically rewrite fftconvolve
like this (simplified):
from scipy.signal.signaltools import _next_regular
def fftconvolve_1d(in1, in2):
outlen = in1.shape[-1] + in2.shape[-1] - 1
n = _next_regular(outlen)
tr1 = np.fft.rfft(in1, n)
tr2 = np.fft.rfft(in2, n)
out = np.fft.irfft(tr1 * tr2, n)
return out[..., :outlen].copy()
And calculate the desired result:
result = fftconvolve_1d(data, Gauss)
This works because numpy.fft.rfft
and .irfft
(notice the lack of n
in the name) transform over a single axis of the input array (the last axis by default). This is roughly 40% faster than the OP code on my system.
Further speedup can be achieved by using a different FFT back-end.
For one, the functions in scipy.fftpack
appear to be somewhat faster than their Numpy equivalents. However, the output format of the Scipy variants is pretty awkward (see docs) and this makes it hard to do the multiplication.
Another possible backend is FFTW through the pyFFTW wrapper. Downsides are that transforms are preceded by a slow "planning phase" and that inputs have to be 16-byte aligned to achieve best performance. This is explained pretty well in the pyFFTW tutorial. The resulting code could be for example:
from scipy.signal.signaltools import _next_regular
import pyfftw
pyfftw.interfaces.cache.enable() # Cache for the "planning"
pyfftw.interfaces.cache.set_keepalive_time(1.0)
def fftwconvolve_1d(in1, in2):
outlen = in1.shape[-1] + in2.shape[-1] - 1
n = _next_regular(outlen)
tr1 = pyfftw.interfaces.numpy_fft.rfft(in1, n)
tr2 = pyfftw.interfaces.numpy_fft.rfft(in2, n)
sh = np.broadcast(tr1, tr2).shape
dt = np.common_type(tr1, tr2)
pr = pyfftw.n_byte_align_empty(sh, 16, dt)
np.multiply(tr1, tr2, out=pr)
out = pyfftw.interfaces.numpy_fft.irfft(pr, n)
return out[..., :outlen].copy()
With aligned inputs and cached "planning" I saw a speedup of almost 3x over the code in the OP. Memory alignment can be easily checked by looking at the memory address in the ctypes.data
attribute of a Numpy array.