Why is `arr.take(idx)` faster than `arr[idx]`

Question

There seems to be common wisdom that using np.take is considerably faster than array indexing. For example http://wesmckinney.com/blog/numpy-indexing-peculiarities/, Fast numpy fancy indexing, and Fast(er) numpy fancy indexing and reduction?. There are also suggestions that np.ix_ is better in some cases.

I've done some profiling, and that does seem to be true in the majority of cases, although the difference decreases as the arrays become larger.
Performance is impacted by the size of the array, the length of the index (for rows), and the number of columns taken. The number of rows seems to have the largest effect, the number of columns in the array also has a an effect even when the index is 1D. Changing the size of the index doesn't seem to affect things much between the methods.

So, the question is two fold: 1. Why is there such a large difference in performance between methods? 2. When does it make sense to use one method over another? Are there some array types, orderings, or shapes that one will always work better for?

There are a lot of things that could be impacting performance, so I've shown a few of them below, and included the code used to try and make this reproducible.

Edit I've updated the y axis on the plots to show the complete range of values. It makes it clearer that the difference is smaller than it appeared for 1D data.

1D index

Looking at running time compared to the number of rows shows that indexing is pretty consistent, with a slight upward trend. take is consistently slower as the number rows increases.

As the number of columns increases both become slower, but take has a higher increase (and this is for a 1D index still).

2D index

With 2D data results are similar. Using ix_ is also shown, and it seems to have the worst performance overall.

Code for figures

from pylab import *
import timeit


def get_test(M, T, C):
    """
    Returns an array and random sorted index into rows
    M : number of rows
    T : rows to take
    C : number of columns
    """
    arr = randn(M, C)
    idx = sort(randint(0, M, T))
    return arr, idx


def draw_time(call, N=10, V='M', T=1000, M=5000, C=300, **kwargs):
    """
    call : function to do indexing, accepts (arr, idx)
    N : number of times to run timeit
    V : string indicating to evaluate number of rows (M) or rows taken (T), or columns created(C)
    ** kwargs : passed to plot
    """
    pts = {
        'M': [10, 20, 50, 100, 500, 1000, 2000, 5000, 10000, 20000, 50000, 100000, 200000, 500000, ],
        'T': [10, 50, 100, 500, 1000, 5000, 10000, 50000],
        'C': [5, 10, 20, 50, 100, 200, 500, 1000],
    }
    res = []

    kw = dict(T=T, M=M, C=C) ## Default values
    for v in pts[V]:
        kw[V] = v
        try:
            arr, idx = get_test(**kw)
        except CallerError:
            res.append(None)
        else:
            res.append(timeit.timeit(lambda :call(arr, idx), number=N))

    plot(pts[V], res, marker='x', **kwargs)
    xscale('log')
    ylabel('runtime [s]')

    if V == 'M':
        xlabel('size of array [rows]')
    elif V == 'T':
        xlabel('number of rows taken')
    elif V == 'C':
        xlabel('number of columns created')

funcs1D = {
    'fancy':lambda arr, idx: arr[idx],
    'take':lambda arr, idx: arr.take(idx, axis=0),
}

cidx = r_[1, 3, 7, 15, 29]
funcs2D = {
    'fancy2D':lambda arr, idx: arr[idx.reshape(-1, 1), cidx],
    'take2D':lambda arr, idx: arr.take(idx.reshape(-1, 1)*arr.shape[1] + cidx),
    'ix_':lambda arr, idx: arr[ix_(idx, cidx)],
}

def test(funcs, N=100, **kwargs):
    for descr, f in funcs.items():
        draw_time(f, label="{}".format(descr), N=100, **kwargs)
    legend()

figure()
title('1D index, 30 columns in data')
test(funcs1D, V='M')
ylim(0, 0.25)
# savefig('perf_1D_arraysize', C=30)

figure()
title('1D index, 5000 rows in data')
test(funcs1D, V='C', M=5000)
ylim(0, 0.07)
# savefig('perf_1D_numbercolumns')

figure()
title('2D index, 300 columns in data')
test(funcs2D, V='M')
ylim(0, 0.01)
# savefig('perf_2D_arraysize')

figure()
title('2D index, 30 columns in data')
test(funcs2D, V='M')
ylim(0, 0.01)
# savefig('perf_2D_arraysize_C30', C=30)

`take` may be faster, but judging from what I've seen in SO questions (and in `numpy` functions) it isn't used nearly as often as 'plain' indexing. — hpaulj, Mar 12 '19 at 17:20

M1L0U · Answer 1 · 2019-04-05T03:07:14.717

The answer is very low level, and have to do with the C compiler and CPU cache optimizations. Please see the active discussion with Sebastian Berg and Max Bolingbroke (both numpy's contributors) on this numpy issue.

Fancy indexing tries to be "smart" about how the memory is read and written (C-order vs F-order), while .take will always keep C-order. This means fancy indexing will usually be much faster for F-ordered arrays, and should always be faster in any case for huge arrays. Now, numpy decides what is the "smart" way without taking the size of the array into consideration, or the particular hardware it is running on. Therefore, for smaller arrays, choosing the "wrong" memory order might actually get better performance thanks to better use of reads in CPU-cache.

Thanks for the link, I'll have to spend some time reading over that. From a link there it seems that the idea `take` is better originated with optimizations for pandas, which makes me feel better ignoring it as a not-always-correct bit of folk wisdom. — user2699, Apr 05 '19 at 12:53

Why is `arr.take(idx)` faster than `arr[idx]`

1D index

2D index

Code for figures

1 Answers1