1

I need to compute the mean of a 2D across one dimension. Here I keep all rows:

import numpy as np, time
x = np.random.random((100000, 500))

t0 = time.time()
y = x.mean(axis=0)       # y.shape is (500,) as expected
print(time.time() - t0)  # 36 milliseconds

When I filter and select some rows, I notice it is 8 times slower. So I tried an easy test where selected_rows are in fact all rows. Still, it is 8 times slower:

selected_rows = np.arange(100000)
t0 = time.time()
y = x[selected_rows, :].mean(axis=0)        # selecting all rows!
print(time.time() - t0) # 280 milliseconds! (for the same result as above!)

Is there a way to speed up the process of selecting certain rows (selected_rows), and computing .mean(axis=0) ?

In the specific case where selected_rows = all rows, it would be interesting to not have 8x slower execution.

Basj
  • 41,386
  • 99
  • 383
  • 673

3 Answers3

1

(Sorry for first version of this answer)

Problem is with creation of new array which takes a lot of time compared to calculating mean.

I tried to optimize whole process using numba:

import numba
@numba.jit('float64[:](float64[:, :], int32[:])')
def selective_mean(array, indices):
    sum = np.zeros(array.shape[1], dtype=np.float64)
    for idx in indices:
        sum += array[idx]
    return sum / array.shape[0]

t0 = time.time()
y2 = selective_mean(x, selected_rows)
print(time.time() - t0) 

There is little slowdown compared to numpy, but much smaller (20% slower?). After compilation (first call to this function) I got about the same timings. For fewer indices array you should see some gain.

dankal444
  • 3,172
  • 1
  • 23
  • 35
  • 1
    Thank you for your answer. No, it is not a bug, I want `y` to be of shape `(500,)` so it should be `mean(axis=0)`. I average over all rows. – Basj Nov 30 '22 at 17:30
  • @Basj sorry, I edited this answer and tried completely different approach – dankal444 Nov 30 '22 at 18:53
  • I think your original answer is also helpful. Your point about the new array being created is what triggered me to write about the difference between slicing and indexing, and led me to time the differences. IMO you should bring that back, and this version adds on to the first. – Pranav Hosangadi Dec 01 '22 at 01:19
  • Thanks @PranavHosangadi, I brought back information about new array being created. The rest was not relevant. – dankal444 Dec 01 '22 at 10:50
1

When you do x[selected_rows, :] where selected_rows is an array, it performs advanced (aka fancy) indexing to create a new array. This is what takes time.

If, instead, you did a slice operation, a view of the original array is created, and that takes less time. For example:

import timeit
import numpy as np

selected_rows = np.arange(0, 100000, 2)
array = np.random.random((100000, 500))

t1 = timeit.timeit("array[selected_rows, :].mean(axis=0)", globals=globals(), number=10)
t2 = timeit.timeit("array[::2, :].mean(axis=0)", globals=globals(), number=10)

print(t1, t2, t1 / t2) # 1.3985465039731935 0.18735826201736927 7.464557414839488

Unfortunately, there's no good way to represent all possible selected_rows as slices, so if you have a selected_rows that can't be represented as a slice, you don't have any other option but to take the hit in performance. There's more information in the answers to these questions:

dankal444's answer here doesn't help in your case, since the axis of the mean call is the axis you wanted to filter in the first place. It is, however, the best way to do this if the filter axis and the mean axis are different -- save the creation of the new array until after you've condensed one axis. You still take a performance hit compared to basic slicing, but it is not as large as if you indexed before the mean call.

For example, if you wanted .mean(axis=1),

t1 = timeit.timeit("array[selected_rows, :].mean(axis=1)", globals=globals(), number=10)
t2 = timeit.timeit("array.mean(axis=1)[selected_rows]", globals=globals(), number=10)
t3 = timeit.timeit("array[::2, :].mean(axis=1)", globals=globals(), number=10)
t4 = timeit.timeit("array.mean(axis=1)[::2]", globals=globals(), number=10)

print(t1, t2, t3, t4)
# 1.4732236850004483 0.3643951010008095 0.21357544500006043 0.32832237200000236

Which shows that

  • Indexing before mean is the worst by far (t1)
  • Slicing before mean is best, since you don't have to spend extra time calculating means for the unnecessary rows (t3)
  • Both indexing (t2) and slicing (t4) after mean are better than indexing before mean, but not better than slicing before mean
Pranav Hosangadi
  • 23,755
  • 7
  • 44
  • 70
1

AFAIK, this is not possible to do this efficiently only in Numpy. The second code is slow because x[selected_rows, :] creates a new array (there is no way to create a view in the general case as explained by @PranavHosangadi). This means a new buffer needs to be allocated, filled with data (causing it to be read on x86-64 platforms due to the write-allocate cache policy and causing slow page faults regarding the system allocator) from x that must also be read from memory. All of this is pretty expensive, not to mention the mean function then read the newly created buffer back from memory. Numpy also performs this fancy indexing operation sub-optimally internally in the C code (due to the overhead of advanced iterators, the small size of the last axis and the number of possible cases to optimize in the Numpy code). ufuncs reduction could help here (they are designed for such a use-case), but the resulting performance will be quite disappointing.

Numba and Cython can help to make this faster. @dankal444 provided a quite fast serial implementation using Numba. Here is a faster parallel chunk-based implementation (which is also a bit more complicated):

import numba as nb

# Use '(float64[:,:], int64[:])' if indices is a 64-bit input.
# Use a list with both signature to be able to use both types at the expense of a slower compilation time.
@nb.jit('(float64[:,:], int32[:])', fastmath=True, parallel=True)
def indexedParallelMean(arr, indices):
    splitCount = 4
    l, m, n = arr.shape[0], arr.shape[1], indices.size
    res = np.zeros((splitCount, m))

    # Parallel reduction of each chunk
    for i in nb.prange(splitCount):
        start = n * i // splitCount
        end = n * (i + 1) // splitCount
        for j in range(start, end):
            res[i, :] += arr[indices[j], :]

    # Final sequential reduction
    for i in range(1, splitCount):
        res[0] += res[i]

    return res[0] / n

Here are performance results on my machine with a i5-9600KF processor (6-cores):

Numpy initial version:         100 ms
Numba serial (of dankal444):    22 ms
Numba parallel (this answer):   11 ms

The computation is memory bound: it saturate about 80% of the throughput of the RAM bandwidth. It is nearly optimal for a high-level Numba/Cython code.

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
  • 1
    dunno if this is possible (question to OP) but we could also use float32 instead of float64 as input array. For "sane" `arr` values quantization error would be minimized by calculating mean. Converting `arr` to float32 would be too expensive I guess – dankal444 Nov 30 '22 at 20:25
  • 1
    Yes, it would make this certainly about 2x faster to use float32 numbers. For the mean, yes, it will be significantly more expensive (about 2x slower). Numpy use a pair-wise algorithm for this rather than a two-pass algorithm and it is generally enough. The thing is a fast pair-wise mean code tends to be bigger and often a bit slower due to the lower locality (eg. the one of Numpy). An alternative solution is a Kahan summation which is more precise and rather simple to use here (but fastmath needs not to be enabled). Still, it is more complex, certainly not useful for float64 but for float32 – Jérôme Richard Nov 30 '22 at 20:38