Fast iteration over numpy array for squared residuals

Question

I like to least-squares match data (a numpy array of floats) with many known signal shapes. My code works but is too slow for the many runs I plan to do:

import numpy
import time

samples = 50000
width_signal = 100
data = numpy.random.normal(0, 1, samples)
signal = numpy.random.normal(0, 1, width_signal)  # Placeholder

t0 = time.clock()
for i in range(samples - width_signal):
    data_chunk = data[i:i + width_signal]
    residuals = data_chunk - signal
    squared_residuals = residuals**2
    summed_residuals = numpy.sum(squared_residuals)
t1 = time.clock()
print('Time elapsed (sec)', t1-t0)

EDIT: Corrected a mistake: First square residuals, then sum them.

This takes about 0.2 sec to run on my machine. As I have many datasets and signal shapes, this is too slow. My specific problem does not allow for typical MCMC methods because the signal shapes are too different. It has to be brute force.

Typical volumes are 50,000 floats for the data and 100 for the signal. These can vary by a factor of a few.

My tests show that:

The summing of the data numpy.sum(residuals) eats 90% of the time. I tried Python's sum(residuals) and it is faster for small arrays (~<50 elements) and slower for bigger arrays. Should I insert an if condition?
I tried numpy.roll() instead of fetching data directly, and .roll() is slower.

Questions:

Is there a logical improvement for speed-up?
Is there a faster way to sum arrays? I know no C, but if it is much faster I could try that.
Can a GPU help? I have many runs to do. If so, where could I find a code snippet to do this?

You may have some luck with [`pandas.rolling`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rolling.html) but then the overhead of creating the DFs might start to dominate — roganjosh, Aug 24 '18 at 10:20
Did you consider using a convolution instead of the sum of squared differences? Also, does the square have to be inside the sum? — xdze2, Aug 24 '18 at 10:33
Ah of course, I made a mistake: The squaring of the residuals must come first, then the summing of these squared residuals. I changed it in the Q. Is it possible to do this with np.lib.stride_tricks.as_strided? Thanks so much. — Michael, Aug 24 '18 at 10:51
Should be `for i in range(samples - width_signal + 1):` to cover all. — Divakar, Aug 24 '18 at 10:55
Regarding the +1 in length: In practice, I will have to roll over the edges. The real data is phase-folded, so the signal can begin near the end of the data and continue at the beginning. — Michael, Aug 24 '18 at 11:06
There's no rolling. With `for i in range(samples - width_signal):` for `data[i:i + width_signal]` you are missing the last element in `data`. — Divakar, Aug 24 '18 at 11:12
Agreed. I have only written it like this in the example to simplify the original question. In my real code, I have to consider rolling. I plan to roll it once (e.g., by 25%) and execute your code twice: on the original, and on the rolled variant. Is that optimal? Your approach #3 is again faster, thanks! — Michael, Aug 24 '18 at 11:28
A GPU solution should also be not so difficult, but I have not much experience with that. Is float32 also enough? There is usually quite a large difference in performance on a GPU between float32 and float64. — max9111, Aug 24 '18 at 12:41
float32 is enough - I'd love to see a proof-of-concept for a GPU! — Michael, Aug 24 '18 at 13:15

Divakar · Accepted Answer · 2018-08-24T11:31:22.123

Based on the various methods proposed in Compute mean squared, absolute deviation and custom similarity measure - Python/NumPy, we look to solve our case here.

Approach #1

We can leverage np.lib.stride_tricks.as_strided based scikit-image's view_as_windows to get sliding windows and thus have our first solution here, like so -

from skimage.util import view_as_windows

d = view_as_windows(data,(width_signal))-signal # diffs
out = np.einsum('ij,ij->i',d,d)

More info on use of as_strided based view_as_windows.

Approach #2

Again based on the matrix-multiplication trick in that answer post, we could better the performance, like so -

def MSD_strided(data, signal):
    w = view_as_windows(data,(width_signal))
    return (w**2).sum(1) + (signal**2).sum(0) - 2*w.dot(signal)

Approach #3

We will improve on approach#2 by bringing on uniform filtering and convolution -

from scipy.ndimage.filters import uniform_filter 

def MSD_uniffilt_conv(data, signal):
    hW = width_signal//2
    l = len(data)-len(signal)+1
    parte1 = uniform_filter(data**2,width_signal)[hW:hW+l]*width_signal
    parte3 = np.convolve(data, signal[::-1],'valid')    
    return parte1 + (signal**2).sum(0) - 2*parte3

Benchmarking

Timings on posted sample -

In [117]: %%timeit
     ...: for i in range(samples - width_signal + 1):
     ...:     data_chunk = data[i:i + width_signal]
     ...:     residuals = data_chunk - signal
     ...:     squared_residuals = residuals**2
     ...:     summed_residuals = numpy.sum(squared_residuals)
1 loop, best of 3: 239 ms per loop

In [118]: %%timeit
     ...: d = view_as_windows(data,(width_signal))-signal
     ...: np.einsum('ij,ij->i',d,d)
100 loops, best of 3: 11.1 ms per loop

In [209]: %timeit MSD_strided(data, signal)
10 loops, best of 3: 18.4 ms per loop

In [210]: %timeit MSD_uniffilt_conv(data, signal)
1000 loops, best of 3: 1.71 ms per loop

~140x speedup there with the third one!

I did quite a bit of searching for a moving window in numpy and this approach didn't appear anywhere. Ok, it's not part of numpy but it's a lot better than what I was finding. Should I have been searching for something other than "numpy moving window" or should it be added as an answer to those threads? — roganjosh, Aug 24 '18 at 10:30
@roganjosh You would find a lot of `np.lib.stride_tricks.as_strided` around on SO. `scikit-image's view_as_windows` is a recent addition to `scikit-image`, so that might be why it's not around much, but I find it pretty neat as compared to tinkering around with the strides and shapes arg for `as_strided` one. Hence, I have been using this `view_as_windows` quite regularly whenever I need to work with sliding windows. As for the term - "NumPy windows" could be good keywords. — Divakar, Aug 24 '18 at 10:33
Yep, I did see `np.lib.stride_tricks.as_strided` actually, but I agree that this is a neat way for it being packaged up :) — roganjosh, Aug 24 '18 at 10:34
. . . Wow. That last approach in your edit is pretty amazing. — Daniel F, Aug 24 '18 at 12:42

score 3 · Answer 2 · answered Aug 24 '18 at 12:23

Apart from the versions given by Divakar you could simply use a compiler like Numba or Cython.

Exmaple

import numba as nb
@nb.njit(fastmath=True,parallel=True)
def sq_residuals(data,signal):
  summed_residuals=np.empty(data.shape[0]+1-signal.shape[0],dtype=data.dtype)
  for i in nb.prange(data.shape[0] - signal.shape[0]+1):
      sum=0.
      for j in range(signal.shape[0]):
        sum+=(data[i+j]-signal[j])**2
      summed_residuals[i]=sum
  return summed_residuals

Timings

Numba 0.4dev, Python 3.6, Quadcore i7
MSD_uniffilt_conv(Divakar): 2.4ms

after the first call which invokes some compilation overhead:
sq_residuals              : 1.7ms

Thanks! This is a useful alternative for cases where Numba/Cython can be used. — Michael, Aug 25 '18 at 08:32

Fast iteration over numpy array for squared residuals

2 Answers2