This approach is inefficient because it iterate over nearest_neighbors
many times while this is not needed. In the worst case, the runtime execution is quadratic to len(nearest_neighbors)
which is really bad for large arrays.
A generic solution to this problem is to use a group-by strategy. Unfortunately, Numpy does not implement such a feature (which is requested by many users since a long time). Pandas does. Using Numpy, you can compute that manually using a sort-based approach. Here is the idea:
index = np.argsort(nearest_neighbors)
groups, offsets = np.unique(nearest_neighbors[index], return_index=True)
for i in range(groups.size):
neighbor_id = groups[i]
group_start = offsets[i]
group_end = offsets[i+1] if i+1<groups.size else index.size
group_index = index[group_start:group_end]
samples = data[group_index]
This solutions runs in quasi-linear time (ie. O(n log n)
) as opposed to a quadratic time (ie. O(n²)
).
Update: profiling & possible optimizations
First of all, using a recent version of Numpy tends to result in a faster execution (thanks to a better use of SIMD units). That being said, it turns out the implementation of data[group_index]
is surprisingly inefficient on Windows compared to Linux. Indeed, here are the results on machine with with a i5-9600KF processor and a 40 GiB/s RAM (Numpy 1.22.4 is used on both Linux and Windows):
Linux:
- Initial code: 77 ms
- Optimized code: 62 ms <---
Windows:
- Initial code: ~210 ms
- Optimized code: 166 ms <---
Optimal time: >= 15 ms
On Windows, a profiling analysis of the optimized implementation shows that ~80% of the time is spent in the memcpy
function called by Numpy. Surprisingly, 15-20% of the time is spent in the free_base
function certainly caused by many temporary Numpy arrays being freed. Overall the RAM throughput is 12~16 GiB and most of the time is lost writing data to RAM while this is not required here.
On Linux, a profiling analysis of the optimized implementation shows that >80% of the time is spent in the Numpy function __memmove_avx_unaligned_erms
executed by data[group_index]
. More specifically, the biggest part of the time is spent in the following assembly loop:
%time | instructions
--------------------------------------------------
2,30 │1a0:┌─→vmovdqu (%rsi),%ymm1
23,30 │ │ vmovdqu 0x20(%rsi),%ymm2
3,26 │ │ vmovdqu 0x40(%rsi),%ymm3
31,26 │ │ vmovdqu 0x60(%rsi),%ymm4
0,53 │ │ sub $0xffffffffffffff80,%rsi
1,62 │ │ vmovdqa %ymm1,(%rdi)
4,57 │ │ vmovdqa %ymm2,0x20(%rdi)
2,79 │ │ vmovdqa %ymm3,0x40(%rdi)
2,33 │ │ vmovdqa %ymm4,0x60(%rdi)
0,48 │ │ sub $0xffffffffffffff80,%rdi
0,01 │ │ cmp %rdi,%rdx
0,39 │ └──ja 1a0
This loop is particularly efficient (besides non-temporal instructions are not used): it reads data to 256-bit AVX SIMD registers and store the result in the temporary array. Most of the time is spent reading data from memory. At first, glance, the writes are not so expensive because the resulting array fits in the L3 cache on my machine.
15% of the time spent in this function lies in 2 other instructions done before loading some data from memory too. Such are a bit expensive because of the access to random lines of data
.
Overall, the difference between Windows and Linux comes from the way Numpy is compiled. AFAIK, On Windows, the Microsoft compiler (MSVC) is used to build Numpy and it makes use of a slow memcpy
function of the Microsoft standard library (CRT); while On Linux, GCC generates a relatively good SIMD-based loop as opposed to a call to memcpy
of the glibc. Numpy is not really responsible for the slowdown : the Microsoft MSVC/CRT are the main issue here.
In practice, Numpy cannot saturate the RAM in sequential using only 1 core because of the way my CPU is designed. It can only read data from the RAM at the speed of 28 GiB/s in sequential while this is 38-40 GiB in parallel. This gap is bigger on server-side computing nodes so using multiple cores should helps. This solution may not be possible regarding what you do with samples
in the loop.
Moreover, writing data in samples
while reading data from the RAM tends to cause cache misses on my machines resulting in samples
being written back to RAM (slower than the L3). This problem can be fixed by computing lines of data
on the fly (so not to write anything in the L3, but only the L1 cache). Numba and Cython can be used to do that efficiently. However, this may not be possible to do regarding what you do with samples
in the loop. This is the best optimization on my machine as this one combined with the previous results in a 18 ms execution time. Using only multiple threads (with Numba) surprisingly do not make the execution faster apparently because it causes more L3 cache misses resulting in more data being written back to RAM (already saturated).
There is not much to do to speed up this code in sequential. Writing data to the L3 cache in sequential takes also 11 ms on my machine (so the best execution time of this method is 26 ms).
Note that the np.unique
calls with the above parameter is apparently not supported yet by Numba so it must be done using Numpy in a separate funciton. That being said, one can compute a group-by using bucket strategy (or a hash-map regarding the real-world input numbers). This methods can be combined with on-the-fly computation of samples
assuming the computation using samples
can be done incrementally.
On Windows, using a quite-naive Numba code can be a good solution to overcome the inefficient Numpy implementation and so to improve the execution time significantly.
Put it shortly, it may or may not be possible to perform this operation faster regarding the exact target platform and the exact computation actually performed in your real-world code.