Accumulate sliding windows relative to origin

Question

I have an array A with the shape (3,3) which can be thought of as the sliding window view of an unkown array with the shape (5,). I want to compute the inverse of windowing the array with the shape (5,). The adjoint operation of this will be summation. What I mean is that I want to accumulate the values in each corresponding window with the related position in the array with the shape (5,). Ofcourse, my expected output of this inverse function and the input A are not related and are just ordinary arrays. I have two examples which I hope explains this better.

A = np.array([[0, 0, 1],
              [0, 0, 1],
              [0, 0, 1]], dtype=np.float32)

I expect this output:

np.array([0, 0, 1, 1, 1])

The other example:

A = np.array([[1, 2, 3],
              [2, 3, 4],
              [3, 4, 5]], dtype=np.float32)

I expect this output:

np.array([1, 2+2, 3+3+3, 4+4, 5]) = np.array([1, 4, 9, 8, 5])

The solution I have which is quite slow (result stored in out)

out = np.zeros(5, dtype=np.float32)
windows = np.lib.stride_tricks.as_strided(out, shape=(3,3), strides=(4,4))
for i in np.ndindex(windows.shape):
  windows[i] += A[i]

Writing to a strided view feels a bit hacky and I am sure there is a better solution.

Is there any way to write this in a vectorized manner, without the for-loop? (which also generalizes for multiple dimensions)

EDIT

In terms of generalizing for higher dimensions, I have cases where the windows are taken from an image (2d array), instead of a 1d array like the example above. For the 2d case, A can for example be windows of size 3. This means that from an image (output) with the shape (4,4), The windows A will have the shape (2,2,3,3).

A = np.array([[[[0, 0, 0],
                [0, 1, 0],
                [0, 0, 0]],

               [[0, 0, 0],
                [1, 0, 0],
                [0, 0, 0]]],


              [[[0, 1, 0],
                [0, 0, 0],
                [0, 0, 0]],

               [[1, 0, 0],
                [0, 0, 0],
                [0, 0, 0]]]], dtype=np.float32)

Using the solution given by Pablo, I get the following error

value array of shape (2,2,3,3)  could not be broadcast to indexing result of shape (2,2)

Using a slightly modified version of my stride solution:

def inverse_sliding_windows(A, window_sz, image_sz):
  out = np.zeros(image_sz, dtype=np.float32)
  windows = np.lib.stride_tricks.sliding_window_view(out, window_sz, writeable=True)
  for i in np.ndindex(windows.shape):
    windows[i] += A[i]

window_sz = (3,3)
image_sz = (4,4)
inverse_sliding_windows(A, window_sz, image_sz)

Output:

array([[0., 0., 0., 0.],
       [0., 4., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]], dtype=float32)

To clarify, the window size and output shape is known beforehand, see inverse_sliding_windows.

Are we talking, in the general case, *nxn* array and output `2*n-1`? — Quang Hoang, Apr 07 '21 at 02:43
For simplicity, the windows are equally dimensioned. So yes, the array with all the windows ```A``` will be an *nxn* array. This implies that the output will have the size ```2*n-1``` for the 1d case. — Kevin, Apr 07 '21 at 02:49
For the 2d case with windows of size ```3```, I would expect it to go from a shape ```(2,2,3,3)``` to the shape ```(4,4)```. — Kevin, Apr 07 '21 at 03:00
is this equivalent to rotate the matrix -45 degrees and sum row-wise? — Pablo C, Apr 07 '21 at 04:47
@PabloC You solution looks correct for all 1d cases. But for the case in the post I just edited, I want to rotate it in 4 dimensions if that makes sense. Fixing it just for when the windows ```A``` are 2d and 4d would be great. — Kevin, Apr 07 '21 at 12:33
A vectorized version doesn't always guarantee better running time. The matrix rotation approach is very inefficient. I do have a solution for 2D case using a for loop and I believe it's much faster than matrix rotation and other naively vectorized ones. Saying that, you might have to decide which one is more important - vectorization or running time. If the generalization for higher dimension isn't necessary, I'll post it here. Otherwise, could you simplify your question for higher dimension? — Shihao Xu, Apr 14 '21 at 08:34
I've just compared the rotation approach and my code (which uses a for loop) on a 4000x4000 matrix. My approach: Wall time: 38.3 ms. Rotation: Wall time: 751 ms. — Shihao Xu, Apr 14 '21 at 08:47
@ShihaoXu I posted an answer using your solution, this might give you some idea of how it could be generalized for higher dimensions. I hope this is sufficient, otherwise I will edit my post with another (simpler?) example. If you do come up with a smart solution, I'll delete my posted answer. — Kevin, Apr 14 '21 at 12:54
@Kevin I've just noticed this comment. The calculation of shapes is still unclear for me. But I think my updated solution should give you an entry point to modify. — Shihao Xu, Apr 14 '21 at 16:51
@ShihaoXu I think I got an idea by looking at your solution. Thanks for the help. — Kevin, Apr 14 '21 at 17:26

Shihao Xu · Accepted Answer · 2021-05-02T01:10:39.373

As I mentioned in the comment, a vectorized solution doesn't always guarantee a better running time. If your matrix is large, you might prefer more efficient methods. And there are several reasons why matrix rotation is slow (though, intuitive), see comment.

Performance comparison:

Solution: Wall time: 61.6 ms
Rotation: Wall time: 3.32 s

Code (tested in jupyter notebook)

import numpy as np

def rotate45_and_sum(A):
    n = len(A) 
    x, y = np.meshgrid(np.arange(n), np.arange(n))  # at least doubled the running time
    xn, yn = x + y, n - x + y - 1   # generating xn and yn at least doubled the running time
    M = np.zeros((2*n -1, 2*n -1))  # at least slows down running time by a factor of 4
    M[xn,yn] = A[x,y] # very inefficient indexing strategy
    return M.sum(1)

def solution(A):
    n = A.shape[0]
    retval = np.zeros(2*n-1)
    for i in range(n):
        retval[i:(i+n)] += A[i, :]
    return retval

A = np.random.randn(10000, 10000)

%time solution(A)

%time rotate45_and_sum(A)

In multidimensional situation:

def solution(A):
    h,w,x,y = A.shape                # change here
    retval = np.zeros((2*x-w,2*y-h)) # change here
    indices = np.ndindex(w, h)       # change here
    for index in indices:
        slices = tuple()
        for i in range(len(index)):
            slices = slices + (slice(index[i], index[i]+x),) # I assume x = y = ..., you need to change here also if the assumption is not correct
        retval[slices] += A[index] # slices is roughly equal `i:(i+x), j:(j+y)` in your code
    return retval

Actually I don't know the how the dimensions (or shapes) are calculated based on your description :(. But i think it could be generalized. The idea is to construct slices as you go. So you need to specify which dimensions correspond to h, w, which correspond to x, y. I think it's not difficult to do that.

Reference: Numpy index array of unknown dimensions?

Regarding https://stackoverflow.com/a/67341994/14923227


def fast(A):
    n = A.shape[0]
    retval = np.zeros(2*n-1)
    for i in range(n):
        retval[i:(i+n)] += A[i, :]
    print(retval.sum())
    return retval

##########################
import threading

class sumThread(threading.Thread):
    def __init__(self, A, mat, threadID, ngroups, size):
        threading.Thread.__init__(self)
        self.threadID = threadID
        self.size = size
        self.ngroups = ngroups
        self.mat = mat
        self.A = A
    def run(self):
        begin = (self.size + self.ngroups) // self.ngroups * self.threadID
        end   = min(self.size, (self.size+self.ngroups)//self.ngroups*(self.threadID+1))
        for i in range(begin, end):
            self.mat[self.threadID, i:(i+self.size)] += self.A[i, :]

def faster(A):
    
    num_threads = max(1, A.shape[0] // 4000) 
    mat = np.zeros((num_threads, 2*A.shape[0]-1))
    threads = []
    for i in range(num_threads):
        t = sumThread(A, mat, i, num_threads, A.shape[0])
        t.start()
        threads.append(t)

    # Wait for all threads to complete
    for t in threads:
        t.join()
    return np.sum(mat, axis=0)

Performance for large array:

A = np.random.randn(20000,20000)
%timeit fast(A)   # 263 ms ± 5.21 ms per loop 
%timeit faster(A) # 155 ms ± 3.14 ms per loop

It's trivial to parallelize the for loop in fast. But fast is actually the most cache efficient (even for GPU cache and memory banks) and thus the fastest way to compute it. Ideally, you can parallelize the code with CUDA/OpenCL since there are way more cores in a GPU. If you do it correctly, the running time will be reduced to log(original_fast_time) with base k, where k is the number of cores you have.

However, there are only a few computations in the function. So the transportation of data between memory and GRAM might dominate. (I didn't test it)

Thanks for this! I did not know that writing it non-vectorized would be faster. Ultimately, I don't care if the solution is vectorized or not - only performance and memory consumption is important. — Kevin, Apr 14 '21 at 12:51
Your're welcome. Many numpy's functions are just syntax sugar. It helps to consider how it is done behind the scene (by the C function which does the actual work). Sometimes a properly vectorized solution can speedup the computation, but it is not clear whether the code you write is "proper" enough. In either case, profiling might be the only tool you can rely on. — Shihao Xu, Apr 14 '21 at 15:37
here is an interesting runtime analysis for another very similar question I posted: https://stackoverflow.com/a/67341994/14923227 — Kevin, May 01 '21 at 14:34
@Kevin I've updated my answer here. It's trivial but might be helpful if you need more speed. — Shihao Xu, May 01 '21 at 23:56
I posted an answer here for using a compiled version of your ```fast``` solution. Would it be possible to speed up it using a parallelized GPU solution? (ignoring the overhead given by the transportation of data between CPU and GPU) — Kevin, May 31 '21 at 23:48
@Kevin I haven't called GPU functionalities directly with `numpy` before (only with CUDA). It's beyond my knowledge. But I think if you rewrite it with pure CUDA, it will be definitely faster neglecting transfer overhead. — Shihao Xu, Jun 02 '21 at 17:54

score 1 · Answer 2 · answered Apr 07 '21 at 05:34

IIUC the problem proposed here is equivalent to rotate matrix A by -45 degrees and sum row-wise (at least for the 2D version). For a better understanding of what I mean by rotating the matrix, see this post.

def rotate45_and_sum(A):
    n = len(A) 
    x, y = np.meshgrid(np.arange(n), np.arange(n)) 
    xn, yn = x + y, n - x + y - 1
    M = np.zeros((2*n -1, 2*n -1)) 
    M[xn,yn] = A[x,y] 
    return M.sum(1)

A = np.array([[0, 0, 1],
              [0, 0, 1],
              [0, 0, 1]], dtype=np.float32)

print(rotate45_and_sum(A))
#[0. 0. 1. 1. 1.]

A = np.array([[1, 2, 3],
              [2, 3, 4],
              [3, 4, 5]], dtype=np.float32)

print(rotate45_and_sum(A))
#[1. 4. 9. 8. 5.]

M is the rotated matrix.

Disclaimer: I don't know if this can be generalized for multiple dimensions

This is really neat, thank you. I made an edit to my post, do you think it is possible to do the same thing for ```A``` with 4 dimensions? — Kevin, Apr 07 '21 at 12:35

score 1 · Answer 3 · answered May 31 '21 at 23:37

Expanding on the fast solution given by @Shihao Xu, I tried translating it to compilable c-code by adding a np.fast_compiled function inside numpy/core/src/multiarray:

NPY_NO_EXPORT PyObject *
arr_fast_compiled(PyObject *NPY_UNUSED(self), PyObject *args)
{
    PyObject *list_obj = NULL;
    PyArrayObject *list_arr = NULL, *ans = NULL;

    npy_intp len, ans_size;
    npy_intp i, j, k;
    double *dans, *numbers;

    if (!PyArg_ParseTuple(args, "O", &list_obj)) {
            goto fail;
    }

    list_arr = (PyArrayObject *)PyArray_ContiguousFromAny(list_obj, NPY_DOUBLE, 2, 2);
    if (list_arr == NULL) {
        goto fail;
    }

    len = PyArray_DIM(list_arr, 0);
    numbers = (double *)PyArray_DATA(list_arr);
    ans_size = 2*len-1;

    ans = (PyArrayObject *)PyArray_ZEROS(1, &ans_size, NPY_DOUBLE, 0);
    if (ans == NULL) {
        goto fail;
    }
    dans = (double *)PyArray_DATA(ans);
    NPY_BEGIN_ALLOW_THREADS;
    for (i = 0; i < len; ++i) {
        k = i * len;
        for (j = i; j < i + len; ++j, ++k) {
            dans[j] += numbers[k];
        }
    }
    NPY_END_ALLOW_THREADS;
    Py_DECREF(list_arr);
    return (PyObject *)ans;

fail:
    Py_XDECREF(list_arr);
    Py_XDECREF(ans);
    return NULL;
}

The for-loop is the most important:

for (i = 0; i < len; ++i) {
    k = i * len;
    for (j = i; j < i + len; ++j, ++k) {   
        dans[j] += numbers[k];
    }
}

numbers is the input argument (A), and we access the elements in numbers and dans in a strided fashion. In the 3x3 example I had we have the following j and k values:

j = [0, 1, 2, 1, 2, 3, 2, 3, 4]
k = [0, 1, 2, 3, 4, 5, 6, 7, 8]

The NPY_BEGIN_ALLOW_THREADS is something I saw frequently used for other numpy functions, but it seems to make no performance difference when I tested it without.

The performance is similar to just_sum_0

Accumulate sliding windows relative to origin

3 Answers3

Linked