When you do x[selected_rows, :]
where selected_rows
is an array, it performs advanced (aka fancy) indexing to create a new array. This is what takes time.
If, instead, you did a slice operation, a view of the original array is created, and that takes less time. For example:
import timeit
import numpy as np
selected_rows = np.arange(0, 100000, 2)
array = np.random.random((100000, 500))
t1 = timeit.timeit("array[selected_rows, :].mean(axis=0)", globals=globals(), number=10)
t2 = timeit.timeit("array[::2, :].mean(axis=0)", globals=globals(), number=10)
print(t1, t2, t1 / t2) # 1.3985465039731935 0.18735826201736927 7.464557414839488
Unfortunately, there's no good way to represent all possible selected_rows
as slices, so if you have a selected_rows
that can't be represented as a slice, you don't have any other option but to take the hit in performance. There's more information in the answers to these questions:
dankal444's answer here doesn't help in your case, since the axis of the mean
call is the axis you wanted to filter in the first place. It is, however, the best way to do this if the filter axis and the mean
axis are different -- save the creation of the new array until after you've condensed one axis. You still take a performance hit compared to basic slicing, but it is not as large as if you indexed before the mean
call.
For example, if you wanted .mean(axis=1)
,
t1 = timeit.timeit("array[selected_rows, :].mean(axis=1)", globals=globals(), number=10)
t2 = timeit.timeit("array.mean(axis=1)[selected_rows]", globals=globals(), number=10)
t3 = timeit.timeit("array[::2, :].mean(axis=1)", globals=globals(), number=10)
t4 = timeit.timeit("array.mean(axis=1)[::2]", globals=globals(), number=10)
print(t1, t2, t3, t4)
# 1.4732236850004483 0.3643951010008095 0.21357544500006043 0.32832237200000236
Which shows that
- Indexing before
mean
is the worst by far (t1
)
- Slicing before
mean
is best, since you don't have to spend extra time calculating means for the unnecessary rows (t3
)
- Both indexing (
t2
) and slicing (t4
) after mean
are better than indexing before mean
, but not better than slicing before mean