Using numpy.take for faster fancy indexing

Question

EDIT I have kept the more complicated problem I am facing below, but my problems with np.take can be summarized better as follows. Say you have an array img of shape (planes, rows), and another array lut of shape (planes, 256), and you want to use them to create a new array out of shape (planes, rows), where out[p,j] = lut[p, img[p, j]]. This can be achieved with fancy indexing as follows:

In [4]: %timeit lut[np.arange(planes).reshape(-1, 1), img]
1000 loops, best of 3: 471 us per loop

But if, instead of fancy indexing, you use take and a python loop over the planes things can be sped up tremendously:

In [6]: %timeit for _ in (lut[j].take(img[j]) for j in xrange(planes)) : pass
10000 loops, best of 3: 59 us per loop

Can lut and img be in someway rearranged, so as to have the whole operation happen without python loops, but using numpy.take (or an alternative method) instead of conventional fancy indexing to keep the speed advantage?

ORIGINAL QUESTION I have a set of look-up tables (LUTs) that I want to use on an image. The array holding the LUTs is of shape (planes, 256, n), and the image has shape (planes, rows, cols). Both are of dtype = 'uint8', matching the 256 axis of the LUT. The idea is to run the p-th plane of the image through each of the n LUTs from the p-th plane of the LUT.

If my lut and img are the following:

planes, rows, cols, n = 3, 4000, 4000, 4
lut = np.random.randint(-2**31, 2**31 - 1,
                        size=(planes * 256 * n // 4,)).view('uint8')
lut = lut.reshape(planes, 256, n)
img = np.random.randint(-2**31, 2**31 - 1,
                    size=(planes * rows * cols // 4,)).view('uint8')
img = img.reshape(planes, rows, cols)

I can achieve what I am after using fancy indexing like this

out = lut[np.arange(planes).reshape(-1, 1, 1), img]

which gives me an array of shape (planes, rows, cols, n) , where out[i, :, :, j] holds the i-th plane of img run through the j-th LUT of the i-th plane of the LUT...

All is good, except for this:

In [2]: %timeit lut[np.arange(planes).reshape(-1, 1, 1), img]
1 loops, best of 3: 5.65 s per loop

which is completely unacceptable, especially since I have all of the following not so nice looking alternatives using np.take than run much faster:

A single LUT on a single plane runs about x70 faster:

In [2]: %timeit np.take(lut[0, :, 0], img[0])
10 loops, best of 3: 78.5 ms per loop

A python loop running through all the desired combinations finishes almost x6 faster:

In [2]: %timeit for _ in (np.take(lut[j, :, k], img[j]) for j in xrange(planes) for k in xrange(n)) : pass
1 loops, best of 3: 947 ms per loop

Even running all combinations of planes in the LUT and image and then discarding the planes**2 - planes unwanted ones is faster than fancy indexing:
```
In [2]: %timeit np.take(lut, img, axis=1)[np.arange(planes), np.arange(planes)]
1 loops, best of 3: 3.79 s per loop
```
And the fastest combination I have been able to come up with has a python loop iterating over the planes and finishes x13 faster:
```
In [2]: %timeit for _ in (np.take(lut[j], img[j], axis=0) for j in xrange(planes)) : pass
1 loops, best of 3: 434 ms per loop
```

The question of course is if there is no way of doing this with np.take without any python loop? Ideally whatever reshaping or resizing is needed should happen on the LUT, not the image, but I am open to whatever you people can come up with...

what is `bkpt` in your snippet -- no need to explain, I just wanted to alert you in case its a typo -- I guess it should be `lut`? — tzelleke, Jan 24 '13 at 00:06
... shouldn't this whole line read `lut = lut.reshape(planes, 256, 4)`, so `4` in the last dim? — tzelleke, Jan 24 '13 at 00:15
@TheodrosZelleke Thanks for catching those! My `lut` is actually a **breakpoint table**, so in my code it is called `bkpt`, and that one I missed when *translating* it for the question. — Jaime, Jan 24 '13 at 00:37
really, don't know. It seems all pretty ugly. One thing, `np.take` is currently only fast if both inputs are c-contiguous (otherwise it copies them). You could translate the 2-d array to a 1-d array by hand probably, but if `img` is really large it probably doesn't matter, and if its worth the juggling around... — seberg, Jan 24 '13 at 10:38
hi, you should give a fully working example, if too long then just create a gist on github. otherwise it is difficult for people to reproduce your problem and try to help. — Andrea Zonca, Apr 20 '13 at 04:55
? (referring to your simpler edit): your two examples are not equivalent. The output of the second one will not be a numpy array `out` like you require. Can you timeit an equivalent example? It will probably include a call to `np.concatenate`, and I imagine the speed advantage might become much smaller. — Juan, May 09 '13 at 07:45
@Juan You can give `np.take` an `out` argument, so I actually preallocate the `out` array and pass slices of it to the calls to `np.take`. — Jaime, Jul 14 '13 at 13:17

Saullo G. P. Castro · Answer 1 · 2013-05-09T11:11:17.100

Fist of all I have to say I really liked your question. Without rearranging LUT or IMG the following solution worked:

%timeit a=np.take(lut, img, axis=1)
# 1 loops, best of 3: 1.93s per loop

But from the result you have to query the diagonal: a[0,0], a[1,1], a[2,2]; to get what you want. I've tried to find a way to do this indexing only for the diagonal elements, but still did not manage.

Here are some ways to rearrange your LUT and IMG: The following works if the indexes in IMG are from 0-255, for the 1st plane, 256-511 for the 2nd plane, and 512-767 for the 3rd plane, but that would prevent you from using 'uint8', which can be a big issue...:

lut2 = lut.reshape(-1,4)
%timeit np.take(lut2,img,axis=0)
# 1 loops, best of 3: 716 ms per loop
# or
%timeit np.take(lut2, img.flatten(), axis=0).reshape(3,4000,4000,4)
# 1 loops, best of 3: 709 ms per loop

in my machine your solution is still the best option, and very adequate since you just need the diagonal evaluations, i.e. plane1-plane1, plane2-plane2 and plane3-plane3:

%timeit for _ in (np.take(lut[j], img[j], axis=0) for j in xrange(planes)) : pass
# 1 loops, best of 3: 677 ms per loop

I hope this can give you some insight about a better solution. It would be nice to look for more options with flatten(), and similar methods as np.apply_over_axes() or np.apply_along_axis(), that seem to be promising.

I used this code below to generate the data:

import numpy as np
num = 4000
planes, rows, cols, n = 3, num, num, 4
lut = np.random.randint(-2**31, 2**31-1,size=(planes*256*n//4,)).view('uint8')
lut = lut.reshape(planes, 256, n)
img = np.random.randint(-2**31, 2**31-1,size=(planes*rows*cols//4,)).view('uint8')
img = img.reshape(planes, rows, cols)

Using numpy.take for faster fancy indexing

1 Answers1

Linked