NumPy Fancy Indexing - Crop different ROIs from different channels

Question

Let assume, we have following numpy 4D array (3x1x4x4):

import numpy as np
n, c, h, w = 3, 1, 4, 4

data = np.arange(n * c * h * w).reshape(n, c, h, w)

>>> array([[[[ 0,  1,  2,  3],
             [ 4,  5,  6,  7],
             [ 8,  9, 10, 11],
             [12, 13, 14, 15]]],

           [[[16, 17, 18, 19],
             [20, 21, 22, 23],
             [24, 25, 26, 27],
             [28, 29, 30, 31]]],

           [[[32, 33, 34, 35],
             [36, 37, 38, 39],
             [40, 41, 42, 43],
             [44, 45, 46, 47]]]])

Now I want to crop each of the n subarray on a different location with a same size:

size = 2
locations = np.array([
    [0, 1],
    [1, 1],
    [0, 2]
])

The slow way to do this is the following:

crops = np.stack([d[:, y:y+size, x:x+size] 
     for d, (y,x) in zip(data, locations)])

>>> array([[[[ 1,  2],
             [ 5,  6]]],

           [[[21, 22],
             [25, 26]]],

           [[[34, 35],
             [38, 39]]]])

Now im looking for a way to achieve this with the numpy's fancy indexing. I have already spent hours to figure out, how to solve this problem. Am I overlooking a simple way to solve this? Are there some numpy indexing experts out there, who can help me?

Upvoted because this is an interesting problem. See if this solves it - https://stackoverflow.com/questions/47411872/. — Divakar, Jan 10 '18 at 15:48

score 1 · Accepted Answer · edited Jun 20 '20 at 09:12

1

We can extend this solution to your 3D case, by leveraging np.lib.stride_tricks.as_strided based sliding-windowed views for efficient patch extraction, like so -

from skimage.util.shape import view_as_windows

def get_patches(data, locations, size):
    # Get 2D sliding windows for each element off data
    w = view_as_windows(data, (1,1,size,size))
    
    # Use fancy/advanced indexing to select the required ones
    return w[np.arange(len(locations)), :, locations[:,0], locations[:,1]][:,:,0,0]

We need those 1,1 as the window parameters to view_as_windows, because it expects the window to have the same number of elements as the number of dims of the input data. We are sliding along the last two axes of data, hence keeping the first two as 1s, basically doing no sliding along the first two axes of data.

Sample runs for one-channel and more than channel data -

In [78]: n, c, h, w = 3, 1, 4, 4 # number of channels = 1
    ...: data = np.arange(n * c * h * w).reshape(n, c, h, w)
    ...: 
    ...: size = 2
    ...: locations = np.array([
    ...:     [0, 1],
    ...:     [1, 1],
    ...:     [0, 2]
    ...: ])
    ...: 
    ...: crops = np.stack([d[:, y:y+size, x:x+size] 
    ...:      for d, (y,x) in zip(data, locations)])

In [79]: print np.allclose(get_patches(data, locations, size), crops)
True

In [80]: n, c, h, w = 3, 5, 4, 4 # number of channels = 5
    ...: data = np.arange(n * c * h * w).reshape(n, c, h, w)
    ...: 
    ...: size = 2
    ...: locations = np.array([
    ...:     [0, 1],
    ...:     [1, 1],
    ...:     [0, 2]
    ...: ])
    ...: 
    ...: crops = np.stack([d[:, y:y+size, x:x+size] 
    ...:      for d, (y,x) in zip(data, locations)])

In [81]: print np.allclose(get_patches(data, locations, size), crops)
True

Benchmarking

Other approaches -

# Original soln
def stack(data, locations, size):
    crops = np.stack([d[:, y:y+size, x:x+size] 
         for d, (y,x) in zip(data, locations)])    
    return crops

# scholi's soln
def allocate_assign(data, locations, size):
    n, c, h, w = data.shape
    crops = np.zeros((n,c,size,size))
    for i, (y,x) in enumerate(locations):
        crops[i,0,:,:] = data[i,0,y:y+size,x:x+size]
    return crops

From the comments, it seems OP is interested in a case with a data of shape (512,1,60,60) and with size as 12,24,48. So, let's setup the data for the same with a function -

# Setup data
def create_inputs(size):
    np.random.seed(0)
    n, c, h, w = 512, 1, 60, 60
    data = np.arange(n * c * h * w).reshape(n, c, h, w)
    locations = np.random.randint(0,3,(n,2))
    return data, locations, size

Timings -

In [186]: data, locations, size = create_inputs(size=12)

In [187]: %timeit stack(data, locations, size)
     ...: %timeit allocate_assign(data, locations, size)
     ...: %timeit get_patches(data, locations, size)
1000 loops, best of 3: 1.26 ms per loop
1000 loops, best of 3: 1.06 ms per loop
10000 loops, best of 3: 124 µs per loop

In [188]: data, locations, size = create_inputs(size=24)

In [189]: %timeit stack(data, locations, size)
     ...: %timeit allocate_assign(data, locations, size)
     ...: %timeit get_patches(data, locations, size)
1000 loops, best of 3: 1.66 ms per loop
1000 loops, best of 3: 1.55 ms per loop
1000 loops, best of 3: 470 µs per loop

In [190]: data, locations, size = create_inputs(size=48)

In [191]: %timeit stack(data, locations, size)
     ...: %timeit allocate_assign(data, locations, size)
     ...: %timeit get_patches(data, locations, size)
100 loops, best of 3: 2.8 ms per loop
100 loops, best of 3: 3.33 ms per loop
1000 loops, best of 3: 1.45 ms per loop

edited Jun 20 '20 at 09:12

Community

1
1

answered Jan 10 '18 at 15:57

Divakar

218,885
19
262
358

Thank you for the fast response! I was still reading the docu about the view_as_windows and was trying things out, as I saw your answer. I will try it out, thanks again! :) – DiKorsch Jan 10 '18 at 16:05
Interesting fact: everything works fine, BUT: both solutions are quietly equally fast... Additionally, I am working with cupy on a GPU, and I had to switch these cropping operations to CPU, since view_as_windows is CPU-only. And as a result, both implementations perform 3 times faster. Just wanted to share this with you. – DiKorsch Jan 10 '18 at 16:16
@BloodyD Both solutions as in your stack based one and this one? I think to see performance gains with this one, you would need a lot more number of windows, i.e. a large number of rows in `locations`. – Divakar Jan 10 '18 at 16:18
yes, both solutions are faster on CPU than on GPU (although I cannot test your solution on the GPU). My current n is 20, I could try a bigger one, like 512 or so... – DiKorsch Jan 10 '18 at 16:20
@BloodyD What exactly is the backend there? Don't think NumPy supports GPU natively. – Divakar Jan 10 '18 at 16:22
Im using cupy (https://docs-cupy.chainer.org/en/stable/). As I metioned in the commend to the solution of scholi, your solution is the fastest, when n is about 512, followed by scholis solution and then mine with np.stack – DiKorsch Jan 10 '18 at 16:29
@BloodyD So, are you sure `np.lib.stride_tricks.as_strided` is supported in `cupy`? Is it making copy there? – Divakar Jan 10 '18 at 16:41
Cupy implements only a subset of numpy's functions, so I don't think it has the stride_tricks module implemented. Nevertheless, its ok for me to switch from cupy to numpy at this point in the code, since it is faster anyway. – DiKorsch Jan 10 '18 at 16:51
@BloodyD Yeah, if you are doing patch extraction, which doesn't involve any compute, stay in CPU, because the memory overhead of transferring to GPU would make it worse probably. So, if you are doing all this on CPU with `n, c, h, w = 512, 1, 4, 4`, I would think this `view_as_windows` should have performed noticeably better. – Divakar Jan 10 '18 at 16:56
It performs only slightly better... Instead of 12 iterations (one iteration is one 512x1x60x60 array) per second it achieves 14. – DiKorsch Jan 10 '18 at 17:02
@BloodyD What `size` value are you using for that setup? – Divakar Jan 10 '18 at 17:05
It varies from 12, 24 to 48. The size of 2 in the original post is just a toy example. – DiKorsch Jan 10 '18 at 18:28
@BloodyD Added timings. Seeing a speedup of `2x+` with the windows based one over others. – Divakar Jan 10 '18 at 18:38
Ok, I will check it tomorrow. I am away from my PC till tomorrow morning. – DiKorsch Jan 10 '18 at 18:46
I have profiled my program, and found out, that doing the cropping on the CPU makes the cropping part to a non-factor compared to the rest. Roughly 3 out of 116 seconds were the cropping during the test. @Divakar, timings of your solution may be better, but in the whole context of my program they don't yield any improvements. Anyway, thank you for the solution and the interesting discussion (I think, we should have moved the discussion to the chat ¯\\_(ツ)_/¯) – DiKorsch Jan 11 '18 at 10:32

scholi · Answer 2 · 2018-01-10T16:14:31.370

0

Stacking is slow. As the size is known it is better to allocate your cropped array first.

crops = np.zeros((3,1,size,size))
for i, (y,x) in enumerate(locations):
    crops[i,0,:,:] = data[i,0,y:y+size,x:x+size]

The solution of Divakar is far the worst in speed. I get 92.3 µs with %%timeit Your stack solution is 35.4 μs and I get 29.3 μs with the example I provide.

edited Jan 10 '18 at 16:14

answered Jan 10 '18 at 16:08

scholi

314
2
10

I doubt, that normal python loop is faster, than list comprehension. Try increase the first to dimension? Or is np.stack so slow? – DiKorsch Jan 10 '18 at 16:18
1

Ok, I have tested your solution with first dimension set to 512, and per-allocating the array gives a small boost compared to my np.stack solution. Nevertheless, the solution of Divakar is slightly faster. – DiKorsch Jan 10 '18 at 16:26

NumPy Fancy Indexing - Crop different ROIs from different channels

2 Answers2

Benchmarking

Linked

Related