There seems to be common wisdom that using np.take
is considerably faster than array indexing. For example http://wesmckinney.com/blog/numpy-indexing-peculiarities/, Fast numpy fancy indexing, and Fast(er) numpy fancy indexing and reduction?. There are also suggestions that np.ix_
is better in some cases.
I've done some profiling, and that does seem to be true in the majority of cases, although the difference decreases as the arrays become larger.
Performance is impacted by the size of the array, the length of the index (for rows), and the number of columns taken. The number of rows seems to have the largest effect, the number of columns in the array also has a an effect even when the index is 1D. Changing the size of the index doesn't seem to affect things much between the methods.
So, the question is two fold: 1. Why is there such a large difference in performance between methods? 2. When does it make sense to use one method over another? Are there some array types, orderings, or shapes that one will always work better for?
There are a lot of things that could be impacting performance, so I've shown a few of them below, and included the code used to try and make this reproducible.
Edit I've updated the y axis on the plots to show the complete range of values. It makes it clearer that the difference is smaller than it appeared for 1D data.
1D index
Looking at running time compared to the number of rows shows that indexing is pretty consistent, with a slight upward trend. take
is consistently slower as the number rows increases.
As the number of columns increases both become slower, but take
has a higher increase (and this is for a 1D index still).
2D index
With 2D data results are similar. Using ix_
is also shown, and it seems to have the worst performance overall.
Code for figures
from pylab import *
import timeit
def get_test(M, T, C):
"""
Returns an array and random sorted index into rows
M : number of rows
T : rows to take
C : number of columns
"""
arr = randn(M, C)
idx = sort(randint(0, M, T))
return arr, idx
def draw_time(call, N=10, V='M', T=1000, M=5000, C=300, **kwargs):
"""
call : function to do indexing, accepts (arr, idx)
N : number of times to run timeit
V : string indicating to evaluate number of rows (M) or rows taken (T), or columns created(C)
** kwargs : passed to plot
"""
pts = {
'M': [10, 20, 50, 100, 500, 1000, 2000, 5000, 10000, 20000, 50000, 100000, 200000, 500000, ],
'T': [10, 50, 100, 500, 1000, 5000, 10000, 50000],
'C': [5, 10, 20, 50, 100, 200, 500, 1000],
}
res = []
kw = dict(T=T, M=M, C=C) ## Default values
for v in pts[V]:
kw[V] = v
try:
arr, idx = get_test(**kw)
except CallerError:
res.append(None)
else:
res.append(timeit.timeit(lambda :call(arr, idx), number=N))
plot(pts[V], res, marker='x', **kwargs)
xscale('log')
ylabel('runtime [s]')
if V == 'M':
xlabel('size of array [rows]')
elif V == 'T':
xlabel('number of rows taken')
elif V == 'C':
xlabel('number of columns created')
funcs1D = {
'fancy':lambda arr, idx: arr[idx],
'take':lambda arr, idx: arr.take(idx, axis=0),
}
cidx = r_[1, 3, 7, 15, 29]
funcs2D = {
'fancy2D':lambda arr, idx: arr[idx.reshape(-1, 1), cidx],
'take2D':lambda arr, idx: arr.take(idx.reshape(-1, 1)*arr.shape[1] + cidx),
'ix_':lambda arr, idx: arr[ix_(idx, cidx)],
}
def test(funcs, N=100, **kwargs):
for descr, f in funcs.items():
draw_time(f, label="{}".format(descr), N=100, **kwargs)
legend()
figure()
title('1D index, 30 columns in data')
test(funcs1D, V='M')
ylim(0, 0.25)
# savefig('perf_1D_arraysize', C=30)
figure()
title('1D index, 5000 rows in data')
test(funcs1D, V='C', M=5000)
ylim(0, 0.07)
# savefig('perf_1D_numbercolumns')
figure()
title('2D index, 300 columns in data')
test(funcs2D, V='M')
ylim(0, 0.01)
# savefig('perf_2D_arraysize')
figure()
title('2D index, 30 columns in data')
test(funcs2D, V='M')
ylim(0, 0.01)
# savefig('perf_2D_arraysize_C30', C=30)