Python efficient summation in large 2D array

Question

My task is fairly simple: I have a large 2D matrix, containing only zeros and ones. For each position in this matrix, I want to sum all pixels in a window around this position. The problem is that the matrix has the shape (166667, 17668) and window sizes range from (333, 333) to (5333, 5333). So far I have only tried on a subset of the data. The code I arrived at:

out_arr = np.array( in_arr.shape )
in_arr = np.pad(in_arr, windowsize//2, mode='reflect')
for y in range(out_arr.shape[0]):
    for x in range(out_arr.shape[1]):
        out_arr[y, x] = np.sum(in_arr[y:y+windowsize, x:x+windowsize])

Obviously, this takes a long time. But in my case it was faster than a rolling window approach using numpy.stride_tricks.as_strided, as described here. I tried compiling it using cython, without effect.

What would be your suggestions to speed this up, apart from parallelizing?
I have a Nvidia Titan X at hand. Is there a way to benefit from that? (e.g. using cupy)

I don't see how this can be faster than a rolling window approach. — MB-F, Mar 27 '18 at 13:16
@mr_mo I referred to the OP's loopy approach. Convolution is fine :) — MB-F, Mar 27 '18 at 13:26
Today it also dawned on me, that I'm actually doing a convolution. Surprisingly, my approach is still faster than scipy convolution. Given a (333,333) window and a (1000,1000) matrix, it times 56s vs 80s with scipy conv, all on one cpu. I then implemented a convolution using tensorflow and the gpu; it takes 46s. But with multithreading I boiled my posted solution down to 10s. — MonsterMax, Mar 28 '18 at 15:54
@MonsterMax You must be using an old version of scipy. Recent versions automatically choose to use `fft` based convolution for `1000 x 1000` and `333 x 333` which on my modest laptop clocks in at `300` ms. On older versions of scipy you can force this behavior using `fftconvolve`. Any reason you didn't try my `cumsum` based answer? If I may say so myself it wipes the floor even with `fftconvolve`: `<75` ms for the same data and window size. — Paul Panzer, Mar 29 '18 at 04:59

Paul Panzer · Accepted Answer · 2018-03-29T05:33:16.433

For windowed summation convolution is actually overkill since a simple O(n) solution exists:

import numpy as np
from scipy.signal import convolve

def winsum(in_arr, windowsize):
    in_arr = np.pad(in_arr, windowsize//2+1, mode='reflect')[:-1, :-1]
    in_arr[0] = 0
    in_arr[:, 0] = 0
    ps = in_arr.cumsum(0).cumsum(1)
    return ps[windowsize:, windowsize:] + ps[:-windowsize, :-windowsize] \
           -  ps[windowsize:, :-windowsize] - ps[:-windowsize, windowsize:]

This is already fast but you can save even more because ps calculated once for the largest window size could be reused for all smaller window sizes.

However, there is one potential drawback, which are the very large numbers that may arise from summing everything like that. A numerically more sound version eliminates this problem by taking the differences first. Downside: the extra saving through sharing ps is no longer available.

def winsum_safe(in_arr, windowsize):
    in_arr = np.pad(in_arr, windowsize//2, mode='reflect')
    in_arr[windowsize:] -= in_arr[:-windowsize]
    in_arr[:, windowsize:] -= in_arr[:, :-windowsize]
    return in_arr.cumsum(0)[windowsize-1:].cumsum(1)[:, windowsize-1:]

For reference, here is the closest competitor which is fft based convolution. You need an up-to-date version of scipy for this to work efficiently. On older versions use fftconvolve instead of convolve.

def winsumc(in_arr, windowsize):
    in_arr = np.pad(in_arr, windowsize//2, mode='reflect')
    kernel = np.ones((windowsize, windowsize), in_arr.dtype)
    return convolve(in_arr, kernel, 'valid')

The next one is to simulate scipy's old - and excruciatingly slow - behavior.

def winsum_nofft(in_arr, windowsize):
    in_arr = np.pad(in_arr, windowsize//2, mode='reflect')
    kernel = np.ones((windowsize, windowsize), in_arr.dtype)
    return convolve(in_arr, kernel, 'valid', method='direct')

Testing and benchmarking:

data = np.random.random((1000, 1000))

assert np.allclose(winsum(data, 333), winsumc(data, 333))
assert np.allclose(winsum(data, 333), winsum_safe(data, 333))

kwds = dict(globals=globals(), number=10)

from timeit import timeit
from time import perf_counter

print('data 100x1000, window 333x333')
print('cumsum:      ', timeit('winsum(data, 333)', **kwds)*100, 'ms')
print('cumsum safe: ', timeit('winsum_safe(data, 333)', **kwds)*100, 'ms')
print('fftconv:     ', timeit('winsumc(data, 333)', **kwds)*100, 'ms')


t = perf_counter()
res = winsum_nofft(data, 99) # 333 just takes too long
t = perf_counter() - t

assert np.allclose(winsum(data, 99), res)

print('data 100x1000, window 99x99')
print('conv:        ', t*1000, 'ms')

Sample output:

data 100x1000, window 333x333
cumsum:       70.33260859316215 ms
cumsum safe:  59.98647050000727 ms
fftconv:      298.60571819590405 ms
data 100x1000, window 99x99
conv:         135224.8261970235 ms

Today I found the time to test your solution. I opted for cumsum_safe, as I only have a few different window sizes to calculate. It has smashed all the others - thank you! Smarter every day :) Is there a similar solution for circular filter kernels, or would one have to fall back to real convolutions for that? — MonsterMax, Mar 29 '18 at 10:07
As long as the kernel is just a rectangle of ones you should be able to simply use `pad` with `mode='wrap'` instead of `'reflect'`. — Paul Panzer, Mar 29 '18 at 10:43

score 1 · Answer 2 · answered Mar 27 '18 at 13:13

@Divakar pointed out in the comments that you can use conv2d and he is right. Here is an example:

import numpy as np
from scipy import signal

data = np.random.rand(5,5) # you original data that you want to sum
kernel = np.ones((2,2)) # square matrix of your dimensions, filled with ones
output = signal.convolve2d(data,kernel,mode='same') # the convolution

Python efficient summation in large 2D array

2 Answers2