Python tile based processing of numpy array

Question

I have a very large numpy array of size [256,256,256,256] which takes up about 8GB of ram.

I want to process the data quickly using multiprocessing, similar to the method used by tile based rendering software such as blender.

I would like to split my data into chunks of smaller size, create a list of processes for them and when one finishes, start working on the next. My current method uses a V split, which is then looped over to perform hsplit and get the chunks. Here is an example of the code for a much smaller data set (9 by 9 divided into 3 by 3 chunks):

import numpy as np

data = np.array(np.arange(0,81)).reshape((9,9))

print(data.shape)
print(data)
print("#"*30)
data = np.array([np.vsplit(set,3) for set in np.hsplit(data,3)])
print(data.shape)
print(data)
print("#"*30)

this doesn't quite do what I want it to, which is creating 9 chunks of 3 by 3 but that's a minor issue. The major issue is I don't think looping through an array like that to apply vsplit is very efficient, but I'm not aware of a built in function that does this automatically without looping.

I tried using map to see if it automatically applies vsplit in a more efficient way, but the timing results were very similar, so I don't think that's the case. Any suggestions on how to optimize this problem?

You may want to have a look at [numexpr](https://github.com/pydata/numexpr) which AFAIK implements something very much like your strategy. — Paul Panzer, Nov 15 '19 at 03:30
@Paul Panzer I still can't see how it helps with dividing the array into smaller chunks. That's where most of my performance increase is coming from. being able to fully utilize all CPU cores is a lot more important than a small improvement to the processing algorithm itself. — OM222O, Nov 15 '19 at 04:05
_"NumExpr parses expressions into its own op-codes that are then used by an integrated computing virtual machine. The array operands are split into small chunks that easily fit in the cache of the CPU and passed to the virtual machine. The virtual machine then applies the operations on each chunk. It's worth noting that all temporaries and constants in the expression are also chunked. Chunks are distributed among the available cores of the CPU, resulting in highly parallelized code execution."_ That not similar to what you are trying? — Paul Panzer, Nov 15 '19 at 04:48

Daniel F · Answer 1 · 2019-11-18T08:50:48.513

2

You want something as_strided based. I'd recommend my recipe here. Using this you can do:

patches = window_nd(data, window = 3, steps = 3)

To pass to multiprocessing, you'll need a generator. Something like:

def patch_gen(data, window, **kwargs):
    dims = len(data.shape)
    patches = window_nd(data, window, **kwargs)
    patches = patches.reshape((-1,) + patches.shape[-dims:])
    for i in range(patches.shape[0]):
        yield patches[i]

You can also use view_as_blocks as from @NilsWerner:

def patch_gen(data, window_shape):
    dims = len(data.shape)
    patches = skimage.util.view_as_blocks(data, window_shape)
    patches = patches.reshape((-1,) + patches.shape[-dims:])
    for i in range(patches.shape[0]):
        yield patches[i]

Then you can do something like:

with Pool(processors) as p:
    out = p.map(func, patch_gen(data, window = 3, steps = 3))

This method should only use the original memory of the array and not make a copy at any point (I think). That way you don't waste time and memory making copies of the original data, all the results are just pointers to the original data.

Caution: don't write to the patches! At least, in general it's a bad idea to write to as_strided results. If you don't have overlapping windows you might be ok, but as soon as your windows overlap (i.e. steps < window) you'll have a mess on your hands.

edited Nov 18 '19 at 08:50

answered Nov 15 '19 at 08:15

Daniel F

13,620
2
29
55

1

You can also output a list of windows from `window_nd` using `outlist = True`, but I'm thinking of changing that to a generator anyway since it can end up throwing `MemoryError`s for larger use cases. – Daniel F Nov 15 '19 at 08:17
thanks, this is exactly what I was looking for. However I'm not sure how to stack the data back to the original shape without copying the data. the final output would be a list of all the processed chunks. Here is an example I wrote: https://repl.it/repls/PlaintiveJointDrawing – OM222O Nov 15 '19 at 11:39
1

@OM222O are you looking to change the data in-place or create a new output? – Daniel F Nov 18 '19 at 06:42
ideally joining them in place since like I mentioned in the beginning, it's at least 8GB of data and I only have 16GB right now. with windows running it gets upto 12 or 13GB usuage which doesn't leave enough space to make a new copy. I just want the data to be broken to chunks in order to process it quickly and then put it back together after processing. I know in C I can use pointers to different parts of the array to start the operation on, which doesn't use any additional memory and the data is not actually segmented, but I'm not sure how to implement that in python. – OM222O Nov 18 '19 at 08:05
1

If you're absolutely sure you're always going to have non-intersecting blocks, you should be able to write back to the resulting `patches` in-place. Problems will happen if your `patches` overlap (you'll get race conditions and some other weird things I'm not enough of a CS guy to untangle). Use with caution! – Daniel F Nov 18 '19 at 08:42
1

Or you could use @NilsWerner 's answer below and build the generator for your `p.map()` as I did. `view_as_blocks` has more built-in guard rails than my recipe and is less likely to trip you up later. – Daniel F Nov 18 '19 at 08:46
there will never be any intersecting blocks since it doesn't even make sense to process an area twice and the rest only once. each instruction (applying filters,averaging, summing,etc) must be applied to the data uniformly which requires non intersecting blocks. The skimage solution is simplistic but the library itself has about 12 to 15 different per-requisits, most of which I've never heard of, so I wasn't sure how good of an idea that is. I will try to look into patches to see if I can find a solution that way. – OM222O Nov 18 '19 at 09:36
@OM222O Even if you don't want to use intersecting blocks, I want to make it clear to anyone else that comes upon this answer. There are many use cases for `as_strided` that involve convolution on overlapping regions and that is right out for this implementation. – Daniel F Nov 18 '19 at 10:01

score 2 · Answer 2 · answered Nov 15 '19 at 08:36

2

You can use SciKit-Image's skimage.util.view_as_blocks(), which solves your blocking problem without moving or copying any of the data.

skimage.util.view_as_blocks(data, (3, 3))

answered Nov 15 '19 at 08:36

Nils Werner

34,832
7
76
98

Python tile based processing of numpy array

2 Answers2