Indexing numpy record arrays is very slow

Question

It looks like indexing numpy record arrays with an array of indices is outrageously slow. However, the same operation can be performed using np.view 10-15 times faster.

Is there a reason behind this difference? Why isn't indexing of record arrays implemented in a faster way? (see also sorting numpy structured and record arrays is very slow)

mydtype = np.dtype("i4,i8")
mydtype.names = ("foo","bar")
N = 100000

foobar = np.zeros(N,dtype = mydtype)
foobar["foo"] = np.random.randint(0,100,N)
foobar["bar"] = np.random.randint(0,10000,N)

b = np.lexsort((foobar["foo"],foobar["bar"]))

timeit foobar[b]
100 loops, best of 3: 11.2 ms per loop

timeit foobar.view("|S12")[b].view(mydtype)
1000 loops, best of 3: 882 µs per loop

Obviously, both results give the same answer.

Maybe it is for the reason that lexsort sorts the array while view just creates a view!? I think this question could also be asked in http://codereview.stackexchange.com/ ! — jkalden, Dec 27 '14 at 13:45
Raising this issue on the numpy github might be more productive. Those are the people who know their way around the numpy source code. — hpaulj, Dec 28 '14 at 07:55

score 3 · Accepted Answer · edited May 23 '17 at 10:28

take, as mentioned in https://stackoverflow.com/a/23303357/901925, is even faster than your double view approach:

np.take(foobar,b)

In fact it's as fast as

foobar['foo'][b]

https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/item_selection.c is a starting point if you want to dig further in to the source code.

My guess is that something in how __getitem__ is implemented causes this difference. Maybe as a remnant of earlier record processing it takes a different path when the dtype is mixed (and for advanced indexing).

Boolean mask indexing doesn't seem to be affected by this slow down. Same for basic sliced indexing.

Indexing numpy record arrays is very slow

1 Answers1

Linked