Efficient numpy slicing of a large 3D array

Question

I have a large 3D numpy array lookup = np.random.rand((1000,1000,1000)). It represents 1000 images of resolution (1000,2000). For every image I'm trying to get a list of values at different locations. I have location array, m = np.random.rand(1000,2)*1000; m = m.astype('int') .

I'm getting values of every slice at those values (see example code below)

%timeit lookup[:,m[:,1], m[:,0]]

This operation ends up being surprisingly slow. I get about 20 ms on my laptop. I would want it to be 1-2 orders of magnitude faster. Is there a better way to slice this type of numpy array?

That's not a slice. It's a fancy index. And there's nothing much you can do about it. 20ms is nothing to worry about. Much of it is likely the overhead of python objects anyway — Mad Physicist, Jun 03 '21 at 17:54
Good to know it's called fancy index, thanks. It's a big worry to me as this is being called a whole lot and this 20 ms becomes quite significant. I need a faster way to retrieve data. — Mike Azatov, Jun 03 '21 at 18:07
Have you a) had an actual timing problem, and b) profiled the code and found this to be the bottleneck? — Mad Physicist, Jun 03 '21 at 18:14
I recommend moving away from python, or at least from the current formulation of your solution then. You could starter by showing more context. — Mad Physicist, Jun 03 '21 at 18:35

Jérôme Richard · Accepted Answer · 2021-06-03T20:21:38.993

This computation is not slow because of Numpy, but because of your hardware, and actually on most modern hardware. The main reason it is slow is because of random RAM accesses resulting in a latency-bound computation.

The input array is big and thus cannot be stored in CPU cache but in RAM. Modern RAM can have a quite high throughput but each fetch require a pretty big latency (about 80ns per random fetch on recent x86 processors). While the throughput of new RAM devices tends to significantly improve over time, the it is barely the case for latency. Fetching one (8-bytes) double-precision floating-point number at a time sequentially would result in a throughput of size_of_double/RAM_latency = 8/80e-9 ≈ 95 MiB/s. This is a fraction of what modern mainstream RAM devices can do (dozens of GiB/s)

To solve this problem, modern processors can fetch several memory block at a time and try to predict RAM accesses and load data ahead of time using prefetching units and speculation. This works well for predictible access patterns (especially contiguous loads) but not at all with a random access pattern like in your code. Still modern processors succeed to fetch multiple memory blocks in parallel with on a sequential code, but not enough so this kind of code can be fast enough (the throughput of your code is about 400 MiB/s). Not to mention mainstream x86 processors load systematically cache-lines of 64-bytes from RAM devices while you only need 8-bytes.

One solution is to parallelize this operation. However, this is not very efficient because of cache-lines (you will barely get more than 10% of the maximal throughput).

An alternative solution is to transpose your input data so that so that the fetched memory blocks can be more contiguous. Here is an example:

transposedLookup = np.array(lookup.T)
%timeit transposedLookup[m[:,0], m[:,1],:].T

Note that the first transposition will be rather slow (mainly because it is not yet optimized by Numpy, but also because of the access pattern) and requires twice the amount of RAM. You can use this solution to speed up the transposition. It is also possible to transpose the input matrix in-place if it is cubic. It would be even better if you could generate the array directly in its transposed form.

Also note that the second transposition is fast because it is done lazily, but even an eagerly transposition is still several times faster than the original code.

Here are the timings on my machine:

Original code: 14.1 ms
Eager Numpy transposition: 2.6 ms
Lazy Numpy transposition: 0.6 ms

EDIT: note that the same thing apply for m.

Wow, this is really impressive! The first transpose is not an issue as I can create lookup the correct way and the second transpose is not an issue as I'll be taking np.min(axis=0) of the output and can pick just switch axis. I already got ~5-10x improvement. Is there a way I can structure `m` to make this query even faster? — Mike Azatov, Jun 03 '21 at 20:49
@MikeAzatov I though transposing `m` would be faster but the difference is not significant as `m` is in cache (if it is much bigger in practice, it may be interesting). I also tried sorting `m`, it is slightly faster (10%) but may make the code a bit less clear/maintainable. I do not see any other possible improvement related to `m`. — Jérôme Richard, Jun 03 '21 at 22:21
Thanks, @JérômeRichard, what do you think is the best way to sort `m` sine it has 2 indices? — Mike Azatov, Jun 04 '21 at 01:38
@mikeazatov tout should sort m regarding the values of m[0] but also reorder transposedLookup so that the transformation is the same for both. I am not sure it worth the effort. — Jérôme Richard, Jun 06 '21 at 09:40

Efficient numpy slicing of a large 3D array

1 Answers1