What is the fastest way to select multiple elements from a numpy array?

Question

I have a small ml library where I want to select elements in an array with a mask from a different array. I want to filter my data in groups depending on what prototype they belong to:

for neighbor_id in np.unique(nearest_neighbors):
   samples = data[nearest_neighbors == neighbor_id ]
   # some function

This line takes 95% of my total runtime.

There are some questions from 8 years ago that had this problem too: Indexing numpy record arrays is very slow

But I couldn't get this solution with np.take() to work for my use case and maybe there are more recent solutions?

Edit: Little benchmark

import numpy as np
from timeit import timeit  

# create test data
rng = np.random.default_rng(seed=42)
neighbors = rng.integers(0, 200, 100000)
data = rng.random(size=(100000, 800))  

def my_solution():
    for neighbor_id in np.unique(neighbors):
        samples = data[neighbors == neighbor_id]


def jeromes_solution():
    index = np.argsort(neighbors)
    groups, offsets = np.unique(neighbors[index], return_index=True)

    for i in range(groups.size):
        neighbor_id = groups[i]
        group_start = offsets[i]
        group_end = offsets[i+1] if i+1 < groups.size else index.size
        group_index = index[group_start:group_end]
        samples = data[group_index]

%%timeit
my_solution()
>>> 222 ms ± 8.19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


%%timeit
jeromes_solution()
>>> 182 ms ± 8.64 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

For larger dimensions, the difference becomes very small. Most of the time is still spend in samples = data[group_index]. Maybe we should sort the array to match the groups first?

What array size, dtype, and number of unique `neighbor_id` are we looking at here? — Matthias Huschle, Feb 12 '23 at 00:18
Please provide a [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example), including a small example input data and the corresponding expected result. Also, if you are interested in nearest neighbor calculations, you should take a look at [`KDTree`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.KDTree.html) — Pierre D, Feb 12 '23 at 00:58

Jérôme Richard · Accepted Answer · 2023-02-12T16:59:39.697

This approach is inefficient because it iterate over nearest_neighbors many times while this is not needed. In the worst case, the runtime execution is quadratic to len(nearest_neighbors) which is really bad for large arrays.

A generic solution to this problem is to use a group-by strategy. Unfortunately, Numpy does not implement such a feature (which is requested by many users since a long time). Pandas does. Using Numpy, you can compute that manually using a sort-based approach. Here is the idea:

index = np.argsort(nearest_neighbors)
groups, offsets = np.unique(nearest_neighbors[index], return_index=True)

for i in range(groups.size):
    neighbor_id = groups[i]
    group_start = offsets[i]
    group_end = offsets[i+1] if i+1<groups.size else index.size
    group_index = index[group_start:group_end]
    samples = data[group_index]

This solutions runs in quasi-linear time (ie. O(n log n)) as opposed to a quadratic time (ie. O(n²)).

Update: profiling & possible optimizations

First of all, using a recent version of Numpy tends to result in a faster execution (thanks to a better use of SIMD units). That being said, it turns out the implementation of data[group_index] is surprisingly inefficient on Windows compared to Linux. Indeed, here are the results on machine with with a i5-9600KF processor and a 40 GiB/s RAM (Numpy 1.22.4 is used on both Linux and Windows):

Linux:
 - Initial code:       77 ms
 - Optimized code:     62 ms  <---

Windows:
 - Initial code:     ~210 ms
 - Optimized code:    166 ms  <---

Optimal time:       >= 15 ms

On Windows, a profiling analysis of the optimized implementation shows that ~80% of the time is spent in the memcpy function called by Numpy. Surprisingly, 15-20% of the time is spent in the free_base function certainly caused by many temporary Numpy arrays being freed. Overall the RAM throughput is 12~16 GiB and most of the time is lost writing data to RAM while this is not required here.

On Linux, a profiling analysis of the optimized implementation shows that >80% of the time is spent in the Numpy function __memmove_avx_unaligned_erms executed by data[group_index]. More specifically, the biggest part of the time is spent in the following assembly loop:

 %time |       instructions
--------------------------------------------------
  2,30 │1a0:┌─→vmovdqu    (%rsi),%ymm1
 23,30 │    │  vmovdqu    0x20(%rsi),%ymm2
  3,26 │    │  vmovdqu    0x40(%rsi),%ymm3
 31,26 │    │  vmovdqu    0x60(%rsi),%ymm4
  0,53 │    │  sub        $0xffffffffffffff80,%rsi
  1,62 │    │  vmovdqa    %ymm1,(%rdi)
  4,57 │    │  vmovdqa    %ymm2,0x20(%rdi)
  2,79 │    │  vmovdqa    %ymm3,0x40(%rdi)
  2,33 │    │  vmovdqa    %ymm4,0x60(%rdi)
  0,48 │    │  sub        $0xffffffffffffff80,%rdi
  0,01 │    │  cmp        %rdi,%rdx
  0,39 │    └──ja         1a0

This loop is particularly efficient (besides non-temporal instructions are not used): it reads data to 256-bit AVX SIMD registers and store the result in the temporary array. Most of the time is spent reading data from memory. At first, glance, the writes are not so expensive because the resulting array fits in the L3 cache on my machine.

15% of the time spent in this function lies in 2 other instructions done before loading some data from memory too. Such are a bit expensive because of the access to random lines of data.

Overall, the difference between Windows and Linux comes from the way Numpy is compiled. AFAIK, On Windows, the Microsoft compiler (MSVC) is used to build Numpy and it makes use of a slow memcpy function of the Microsoft standard library (CRT); while On Linux, GCC generates a relatively good SIMD-based loop as opposed to a call to memcpy of the glibc. Numpy is not really responsible for the slowdown : the Microsoft MSVC/CRT are the main issue here.

In practice, Numpy cannot saturate the RAM in sequential using only 1 core because of the way my CPU is designed. It can only read data from the RAM at the speed of 28 GiB/s in sequential while this is 38-40 GiB in parallel. This gap is bigger on server-side computing nodes so using multiple cores should helps. This solution may not be possible regarding what you do with samples in the loop.

Moreover, writing data in samples while reading data from the RAM tends to cause cache misses on my machines resulting in samples being written back to RAM (slower than the L3). This problem can be fixed by computing lines of data on the fly (so not to write anything in the L3, but only the L1 cache). Numba and Cython can be used to do that efficiently. However, this may not be possible to do regarding what you do with samples in the loop. This is the best optimization on my machine as this one combined with the previous results in a 18 ms execution time. Using only multiple threads (with Numba) surprisingly do not make the execution faster apparently because it causes more L3 cache misses resulting in more data being written back to RAM (already saturated).

There is not much to do to speed up this code in sequential. Writing data to the L3 cache in sequential takes also 11 ms on my machine (so the best execution time of this method is 26 ms).

Note that the np.unique calls with the above parameter is apparently not supported yet by Numba so it must be done using Numpy in a separate funciton. That being said, one can compute a group-by using bucket strategy (or a hash-map regarding the real-world input numbers). This methods can be combined with on-the-fly computation of samples assuming the computation using samples can be done incrementally.

On Windows, using a quite-naive Numba code can be a good solution to overcome the inefficient Numpy implementation and so to improve the execution time significantly.

Put it shortly, it may or may not be possible to perform this operation faster regarding the exact target platform and the exact computation actually performed in your real-world code.

Thanks for the idea! This works good for some examples, but in my little benchmark the last line is still the bottleneck (see main post) — Sandro Martens, Feb 12 '23 at 10:36
I updated the answer with a detailed analysis and possible optimization. During the analysis, I found out that the code is surprisingly slow on Windows while it is fast on Linux and explained why. — Jérôme Richard, Feb 12 '23 at 17:02

What is the fastest way to select multiple elements from a numpy array?

1 Answers1

Update: profiling & possible optimizations