Vectorized way to add multiple overlapping cubes

Question

I'm using sliding windows to do deep learning on a large rectangular image. The image has shape (height, width).

The prediction output is an ndarray of shape (height, width, prediction_probability). My predictions are output in overlapping windows which I need to add together to get pixel by pixel predictions for the entire input image. The windows overlap by more than 2 pixels in the (height, width).

In C++ previously I've done stuff like this, creating a large result index and then adding all the ROIs together.

#include <opencv2/core/core.hpp>
using namespace std;  

template <int numberOfChannels>
static void AddBlobToBoard(Mat& board, vector<float> blobData,
                                            int blobWidth, int blobHeight,
                                            Rect roi) {
 for (int y = roi.y; y < roi.y + roi.height; y++) {
    auto vecPtr = board.ptr< Vec <float, numberOfChannels> >(y);
    for (int x = roi.x; x < roi.x + roi.width; x++) {
      for (int channel = 0; channel < numberOfChannels; channel++) {
          vecPtr[x][channel] +=
            blobData[(band * blobHeight + y - roi.y) * blobWidth + x - roi.x];}}}

Is there a vectorized way of doing this in Python?

Voting to close as Too Broad. Use numpy ndarrays to hold the image: there are a number of implementations for sliding windows using Numpy as_strided stride tricks that can be found with a search. Scikit has an extract patches method/function - it used to use as_strided - don't know if it still does. — wwii, Aug 27 '18 at 17:27
@wwii thanks for commenting on your downvote. What can I do to make the question less broad? I'm not asking about all methods of adding multiple rectangles, I'm asking for the fastest method. The documentation at https://docs.scipy.org/doc/numpy/reference/generated/numpy.lib.stride_tricks.as_strided.html states that "it is advisable to avoid as_strided when possible" Once again, thanks for your help. — empty, Aug 27 '18 at 17:46
Are you writing back to the image after processing or just extracting data, processing it, and using the processed data elsewhere? — wwii, Aug 27 '18 at 18:35
Here are some non-stride_trcks solutions. https://stackoverflow.com/a/4947453/2823755, https://stackoverflow.com/a/42258242/2823755, https://stackoverflow.com/a/10906015/2823755 — wwii, Aug 27 '18 at 18:47
@wwii the examples are for the creation of sliding windows, which I already have. I'm looking for ways to stitch the overlapping windows together again. — empty, Aug 27 '18 at 21:02
It's not overly broad, and it's not code review. It's looking for a functionality that `as_strided` doesn't currently have - namely vectorized writing. If you try to read back to an `as_strided` view as done above you'll get a race condition as the parallel writes will all be pointing to the same memory block. Probably the best solution is to build a `numba` function with parellel functionality to the `c++` code. There's no real way to do this that I see without using `for` loops. And that's going to be slow without `numba` — Daniel F, Aug 28 '18 at 07:22
@Kevin, you might want to edit in a blurb to the end of your question similar to my coment above before this gets closed by the reviewers. And try to avoid words like "fastest" or "best" in questions as they are just close vote bait. You want a fast way to do something that currently needs a `for` loop. Ask for a "vectorized" solution (you won't get it for this question, but you also would avoid the close votes). — Daniel F, Aug 28 '18 at 07:24
not my question @DanielF. Are you sure there's not a vectorized solution here? I'm imagining a dim=(n,m) input array generating a dim=(n,m,k,k) windows array, then stacking windows into their proper positions in an (n,m,k**2) output array -- each window gets its own [:,:,i] slice-- before summing over the last axis. — kevinkayaks, Aug 28 '18 at 07:31
Also check [`this`](https://stackoverflow.com/questions/45685772/will-results-of-numpy-as-strided-depend-on-input-dtype/45812234#45812234) out for why someone might say "it is advisable to avoid `as_strided` when possible". — Daniel F, Aug 28 '18 at 07:31
I know, @kevinkayaks, the questioner is also named Kevin. :P — Daniel F, Aug 28 '18 at 07:33
There is probably a vectorized solution, but it would likely cause memory errors as you would have to basically do a `.copy()` on the `as_strided` view. — Daniel F, Aug 28 '18 at 07:36
Edited to keep the review queue from closing. Feel free to revert if I've misunderstood the question. — Daniel F, Aug 28 '18 at 07:53
@kevinkayaks earlier the question was "what is the fastest way ..." That would be a matter of opinion and resources (cpu/gpu/cluster ...). The question is now a SO question changing it to "vectorized implementation" — Back2Basics, Aug 28 '18 at 23:26

kevinkayaks · Accepted Answer · 2018-08-28T16:58:47.743

Edit:

@Kevin IMO if you're training a network anyway, you should do this step with a fully connected layer. that said..

I have a non-vectorized solution if you want something to work with. Any solution will be memory intensive. On my laptop it works sorta fast for CIFAR sized gray images (32x32). Maybe the key step could be vectorized by someone clever.

First split a test array arr into windows win using skimage. This is test data.

>>> import numpy as np
>>> from skimage.util.shape import view_as_windows as viewW
>>> arr = np.arange(20).reshape(5,4)
>>> win = viewW(arr, (3,3))
>>> arr # test data 
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19]])
>>> win[0,0]==arr[:3,:3] # it works. 
array([[ True,  True,  True],
       [ True,  True,  True],
       [ True,  True,  True]])

Now to recombine, generate an output array out with shape (5,4,6). 6 is the number of windows in win and (5,4) is arr.shape. We will populate this array by one window in each slice along the -1 axis.

# the array to be filled
out = np.zeros((5,4,6)) # shape of original arr stacked to the number of windows 

# now make the set of indices of the window corners in arr 
inds = np.indices((3,2)).T.reshape(3*2,2)

# and generate a list of slices. each selects the position of one window in out 
slices = [np.s_[i[0]:i[0]+3:1,i[1]:i[1]+3:1,j] for i,j in zip(inds,range(6))] 

# this will be the slow part. You have to loop through the slices. 
# does anyone know a vectorized way to do this? 
for (ii,jj),slc in zip(inds,slices):
    out[slices] = win[ii,jj,:,:]

Now the out array contains all of the windows in their proper positions but separated into panes across the -1 axis. To extract your original array you can average all elements down this axis which do not contain zeros.

>>> out = np.true_divide(out.sum(-1),(out!=0).sum(-1))
>>> # this can't handle scenario where all elements in an out[i,i,:] are 0
>>> # so set nan to zero 

>>> out = np.nan_to_num(out)
>>> out 
array([[ 0.,  1.,  2.,  3.],
       [ 4.,  5.,  6.,  7.],
       [ 8.,  9., 10., 11.],
       [12., 13., 14., 15.],
       [16., 17., 18., 19.]])

Can you think up a way to operate over an array of slices in a vectorized way?

All together:

def from_windows(win):
    """takes in an arrays of windows win and returns the original array from which they come"""
    a0,b0,w,w = win.shape # shape of window
    a,b = a0+w-1,b0+w-1 # a,b are shape of original image 
    n = a*b # number of windows 
    out = np.zeros((a,b,n)) # empty output to be summed over last axis 
    inds = np.indices((a0,b0)).T.reshape(a0*b0,2) # indices of window corners into out 
    slices = [np.s_[i[0]:i[0]+3:1,i[1]:i[1]+3:1,j] for i,j in zip(inds,range(n))] # make em slices 
    for (ii,jj),slc in zip(inds,slices): # do the replacement into out 
        out[slc] = win[ii,jj,:,:]
    out = np.true_divide(out.sum(-1),(out!=0).sum(-1)) # average over all nonzeros 
    out = np.nan_to_num(out) # replace any nans remnant from np.alltrue(out[i,i,:]==0) scenario
    return out # hope you've got ram

and the test:

>>> arr = np.arange(32**2).reshape(32,32)
>>> win = viewW(arr, (3,3))
>>> np.alltrue(arr==from_windows(win))
True
>>> %timeit from_windows(win)
6.3 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Practically speaking this is not going to be fast enough for you to train on

Vectorized way to add multiple overlapping cubes

1 Answers1