12

I am working on a spatial search case for spheres in which I want to find connected spheres. For this aim, I searched around each sphere for spheres that centers are in a (maximum sphere diameter) distance from the searching sphere’s center. At first, I tried to use scipy related methods to do so, but scipy method takes longer times comparing to equivalent numpy method. For scipy, I have determined the number of K-nearest spheres firstly and then find them by cKDTree.query, which lead to more time consumption. However, it is slower than numpy method even by omitting the first step with a constant value (it is not good to omit the first step in this case). It is contrary to my expectations about scipy spatial searching speed. So, I tried to use some list-loops instead some numpy lines for speeding up using numba prange. Numba run the code a little faster, but I believe that this code can be optimized for better performances, perhaps by vectorization, using other alternative numpy modules or using numba in another way. I have used iteration on all spheres due to prevent probable memory leaks and …, where number of spheres are high.

import numpy as np
import numba as nb
from scipy.spatial import cKDTree, distance

# ---------------------------- input data ----------------------------
""" For testing by prepared files:
radii = np.load('a.npy')     # shape: (n-spheres, )     must be loaded by np.load('a.npy') or np.loadtxt('radii_large.csv')
poss = np.load('b.npy')      # shape: (n-spheres, 3)    must be loaded by np.load('b.npy') or np.loadtxt('pos_large.csv', delimiter=',')
"""

rnd = np.random.RandomState(70)
data_volume = 200000

radii = rnd.uniform(0.0005, 0.122, data_volume)
dia_max = 2 * radii.max()

x = rnd.uniform(-1.02, 1.02, (data_volume, 1))
y = rnd.uniform(-3.52, 3.52, (data_volume, 1))
z = rnd.uniform(-1.02, -0.575, (data_volume, 1))
poss = np.hstack((x, y, z))
# --------------------------------------------------------------------

# @nb.jit('float64[:,::1](float64[:,::1], float64[::1])', forceobj=True, parallel=True)
def ends_gap(poss, dia_max):
    particle_corsp_overlaps = np.array([], dtype=np.float64)
    ends_ind = np.empty([1, 2], dtype=np.int64)
    """ using list looping """
    # particle_corsp_overlaps = []
    # ends_ind = []

    # for particle_idx in nb.prange(len(poss)):  # by list looping
    for particle_idx in range(len(poss)):
        unshared_idx = np.delete(np.arange(len(poss)), particle_idx)                                                    # <--- relatively high time consumer
        poss_without = poss[unshared_idx]

        """ # SCIPY method ---------------------------------------------------------------------------------------------
        nears_i_ind = cKDTree(poss_without).query_ball_point(poss[particle_idx], r=dia_max)         # <--- high time consumer
        if len(nears_i_ind) > 0:
            dist_i, dist_i_ind = cKDTree(poss_without[nears_i_ind]).query(poss[particle_idx], k=len(nears_i_ind))       # <--- high time consumer
            if not isinstance(dist_i, float):
                dist_i[dist_i_ind] = dist_i.copy()
        """  # NUMPY method --------------------------------------------------------------------------------------------
        lx_limit_idx = poss_without[:, 0] <= poss[particle_idx][0] + dia_max
        ux_limit_idx = poss_without[:, 0] >= poss[particle_idx][0] - dia_max
        ly_limit_idx = poss_without[:, 1] <= poss[particle_idx][1] + dia_max
        uy_limit_idx = poss_without[:, 1] >= poss[particle_idx][1] - dia_max
        lz_limit_idx = poss_without[:, 2] <= poss[particle_idx][2] + dia_max
        uz_limit_idx = poss_without[:, 2] >= poss[particle_idx][2] - dia_max

        nears_i_ind = np.where(lx_limit_idx & ux_limit_idx & ly_limit_idx & uy_limit_idx & lz_limit_idx & uz_limit_idx)[0]
        if len(nears_i_ind) > 0:
            dist_i = distance.cdist(poss_without[nears_i_ind], poss[particle_idx][None, :]).squeeze()                   # <--- relatively high time consumer
        # """  # -------------------------------------------------------------------------------------------------------
            contact_check = dist_i - (radii[unshared_idx][nears_i_ind] + radii[particle_idx])
            connected = contact_check[contact_check <= 0]

            particle_corsp_overlaps = np.concatenate((particle_corsp_overlaps, connected))
            """ using list looping """
            # if len(connected) > 0:
            #    for value_ in connected:
            #        particle_corsp_overlaps.append(value_)

            contacts_ind = np.where([contact_check <= 0])[1]
            contacts_sec_ind = np.array(nears_i_ind)[contacts_ind]
            sphere_olps_ind = np.where((poss[:, None] == poss_without[contacts_sec_ind][None, :]).all(axis=2))[0]       # <--- high time consumer

            ends_ind_mod_temp = np.array([np.repeat(particle_idx, len(sphere_olps_ind)), sphere_olps_ind], dtype=np.int64).T
            if particle_idx > 0:
                ends_ind = np.concatenate((ends_ind, ends_ind_mod_temp))
            else:
                ends_ind[0, 0], ends_ind[0, 1] = ends_ind_mod_temp[0, 0], ends_ind_mod_temp[0, 1]
            """ using list looping """
            # for contacted_idx in sphere_olps_ind:
            #    ends_ind.append([particle_idx, contacted_idx])

    # ends_ind_org = np.array(ends_ind)  # using lists
    ends_ind_org = ends_ind
    ends_ind, ends_ind_idx = np.unique(np.sort(ends_ind_org), axis=0, return_index=True)                                # <--- relatively high time consumer
    gap = np.array(particle_corsp_overlaps)[ends_ind_idx]
    return gap, ends_ind, ends_ind_idx, ends_ind_org

In one of my tests on 23000 spheres, scipy, numpy, and numba-aided methods finished the loop in about 400, 200, and 180 seconds correspondingly using Colab TPU; for 500.000 spheres it take 3.5 hours. These execution times are not satisfying at all for my project, where number of spheres may be up to 1.000.000 in a medium data volume. I will call this code many times in my main code and seeking for ways that could perform this code in milliseconds (as much as fastest that it could). Is it possible?? I would be appreciated if anyone would speed up the code as it is needed.

Notes:

  • This code must be executable with python 3.7+, on CPU and GPU.
  • This code must be applicable for data size, at least, 300.000 spheres.
  • All numpy, scipy, and … equivalent modules instead of my written modules, which make my code faster significantly, will be upvoted.

I would be appreciated for any recommendations or explanations about:

  1. Which method could be faster in this subject?
  2. Why scipy is not faster than other methods in this case and where it could be helpful relating to this subject?
  3. Choosing between iterator methods and matrix form methods is a confusing matter for me. Iterating methods use less memory and could be used and tuned up by numba and … but, I think, are not useful and comparable with matrix methods (which depends on memory limits) like numpy and … for huge sphere numbers. For this case, perhaps I could omit the iteration by numpy, but I guess strongly that it cannot be handled due to huge matrix size operations and memory leaks.

Prepared sample test data:

Poss data: 23000, 500000
Radii data: 23000, 500000
Line by line speed test logs: for two test cases scipy method and numpy time consumption.

Ali_Sh
  • 2,667
  • 3
  • 43
  • 66
  • Note that `dist_i` is not set if `len(nears_i_ind)` is 0 in the Scipy implementation and thus it takes the value from previous iterations. I guess this should not happen since the Numpy implementation recompute it but the remaining code does not appear to support empty arrays. – Jérôme Richard Feb 15 '22 at 01:07
  • Just for a little more explanation: I used `if particle_idx > 0, else` for where `np.empty` is updating, because in my cases, spheres have overlaps **from the beginning**. It could be written in a more comprehensive form that can handle models with free spheres. I used `if len(nears_i_ind) > 0` for two reasons: 1. Preventing probable `cKDTree.query` error for free spheres when `k=0` and 2. Preventing additional calculations if that get no error (But it will I think). `nears_i_ind ` converging to `0` in my main code. – Ali_Sh Feb 15 '22 at 13:23
  • @JérômeRichard I put two `#` under scipy part (before update the question) which I meant the following code lines must be moved forward by *tab* for scipy method, so it doesn't take the previous values (however, I have placed scipy here just for evaluating its performance). Now, I update it for numpy. I used `np.empty` to replace commented list operations by numpy like codes (which can be written by other numpy modules in another better form, I wrote the code with what came to my mind; here it seems lists will perform better). – Ali_Sh Feb 15 '22 at 15:02
  • 1
    See also [this related answer](https://stackoverflow.com/a/45903853/980270). If there is a chance that you can interface with Java or C++ then using a different tree that works well with large datasets (R-Tree, PH-Tree (disclaimer: self-advertisement)) may be a good idea. – TilmannZ Feb 16 '22 at 11:38
  • @TilmannZ Thanks for the suggestions, No, just Python. I have read your answer on the link (+1). I have tried `R_tree` using one of python libraries (I don't remember it's name, I think it was, just, implementing `R_tree`s) that was emphasized on some SO issues by others to be more satisfying than scipy, but it takes more time comparing to scipy on my example cases, which was contrary to my expectations and studies about its performance, and I returned to my scipy code. – Ali_Sh Feb 16 '22 at 21:21

5 Answers5

7

UPDATE: this post answered is now superseded by this new one (which take into account the updates of the question) providing an even faster code based on a different approach.


Step 1: better algorithm

First of all, building a k-d tree runs in O(n log n) time and doing a query runs in O(log n) time where n is the number of points. So using a k-d tree seems a good idea at first glance. However, your code build a k-d tree for each point resulting in a O(n² log n) time. This is why the Scipy solution is slower than the others. The thing is that Scipy does not provide a way to update a k-d tree. It turns out that updating efficiently a k-d tree appears not to be possible. Hopefully, this is not a problem in your case: you can just build one k-d tree with all the points once and then discard the current point you do not want appearing in the result of each query.

Moreover, the computation of sphere_olps_ind runs in O(n² m) time where n is the total number of points and m is the average number of neighbour (ie. closest points retrieved from the k-d tree query). Assuming there is no duplicate points, then it turns out that sphere_olps_ind is simply equal to np.sort(contacts_sec_ind). The later runs in O(m log m) which is drastically better.

Additionally, using np.concatenate in a loop to append value in a Numpy array is slow because it creates a new bigger array for each iteration. Using a list was a good idea, but appending directly Numpy array in a list and then calling np.concatenate once is much faster.

Here is the resulting code:

def ends_gap(poss, dia_max):
    particle_corsp_overlaps = []
    ends_ind = [np.empty([1, 2], dtype=np.int64)]

    kdtree = cKDTree(poss)

    for particle_idx in range(len(poss)):
        # Find the nearest point including the current one and
        # then remove the current point from the output.
        # The distances can be computed directly without a new query.
        cur_point = poss[particle_idx]
        nears_i_ind = np.array(kdtree.query_ball_point(cur_point, r=dia_max), dtype=np.int64)
        assert len(nears_i_ind) > 0

        if len(nears_i_ind) <= 1:
            continue

        nears_i_ind = nears_i_ind[nears_i_ind != particle_idx]
        dist_i = distance.cdist(poss[nears_i_ind], cur_point[None, :]).squeeze()

        contact_check = dist_i - (radii[nears_i_ind] + radii[particle_idx])
        connected = contact_check[contact_check <= 0]

        particle_corsp_overlaps.append(connected)

        contacts_ind = np.where([contact_check <= 0])[1]
        contacts_sec_ind = nears_i_ind[contacts_ind]
        sphere_olps_ind = np.sort(contacts_sec_ind)

        ends_ind_mod_temp = np.array([np.repeat(particle_idx, len(sphere_olps_ind)), sphere_olps_ind], dtype=np.int64).T
        if particle_idx > 0:
            ends_ind.append(ends_ind_mod_temp)
        else:
            ends_ind[0][:] = ends_ind_mod_temp[0, 0], ends_ind_mod_temp[0, 1]

    ends_ind_org = np.concatenate(ends_ind)
    ends_ind, ends_ind_idx = np.unique(np.sort(ends_ind_org), axis=0, return_index=True)                                # <--- relatively high time consumer
    gap = np.concatenate(particle_corsp_overlaps)[ends_ind_idx]
    return gap, ends_ind, ends_ind_idx, ends_ind_org

Step 2: optimization

First of all, the query_ball_point call can be done on all the points at once in parallel by providing poss to the Scipy method and specifying the parameter workers=-1. However, note that this requires more memory.

Moreover, Numba can be used to significantly speed up the computation. The parts that can be mainly improved is the computation of the distances and the creation of many unnecessary temporary arrays as well as the use of Numpy array direct indexing instead of list's appends (since the bounded size of the output array can be known after the query_ball_point call).

Here is a simple example of optimized code using Numba:

@nb.jit('(float64[:, ::1], int64[::1], int64[::1], float64)')
def compute(poss, all_neighbours, all_neighbours_sizes, dia_max):
    particle_corsp_overlaps = []
    ends_ind_lst = [np.empty((1, 2), dtype=np.int64)]
    an_offset = 0

    for particle_idx in range(len(poss)):
        cur_point = poss[particle_idx]
        cur_len = all_neighbours_sizes[particle_idx]
        nears_i_ind = all_neighbours[an_offset:an_offset+cur_len]
        an_offset += cur_len
        assert len(nears_i_ind) > 0

        if len(nears_i_ind) <= 1:
            continue

        nears_i_ind = nears_i_ind[nears_i_ind != particle_idx]
        dist_i = np.empty(len(nears_i_ind), dtype=np.float64)

        # Compute the distances
        x1, y1, z1 = poss[particle_idx]
        for i in range(len(nears_i_ind)):
            x2, y2, z2 = poss[nears_i_ind[i]]
            dist_i[i] = np.sqrt((x2-x1)**2 + (y2-y1)**2 + (z2-z1)**2)

        contact_check = dist_i - (radii[nears_i_ind] + radii[particle_idx])
        connected = contact_check[contact_check <= 0]

        particle_corsp_overlaps.append(connected)

        contacts_ind = np.where(contact_check <= 0)
        contacts_sec_ind = nears_i_ind[contacts_ind]
        sphere_olps_ind = np.sort(contacts_sec_ind)

        ends_ind_mod_temp = np.empty((len(sphere_olps_ind), 2), dtype=np.int64)
        for i in range(len(sphere_olps_ind)):
            ends_ind_mod_temp[i, 0] = particle_idx
            ends_ind_mod_temp[i, 1] = sphere_olps_ind[i]

        if particle_idx > 0:
            ends_ind_lst.append(ends_ind_mod_temp)
        else:
            tmp = ends_ind_lst[0]
            tmp[:] = ends_ind_mod_temp[0, :]

    return particle_corsp_overlaps, ends_ind_lst

def ends_gap(poss, dia_max):
    kdtree = cKDTree(poss)
    tmp = kdtree.query_ball_point(poss, r=dia_max, workers=-1)
    all_neighbours = np.concatenate(tmp, dtype=np.int64)
    all_neighbours_sizes = np.array([len(e) for e in tmp], dtype=np.int64)
    particle_corsp_overlaps, ends_ind_lst = compute(poss, all_neighbours, all_neighbours_sizes, dia_max)
    ends_ind_org = np.concatenate(ends_ind_lst)
    ends_ind, ends_ind_idx = np.unique(np.sort(ends_ind_org), axis=0, return_index=True)
    gap = np.concatenate(particle_corsp_overlaps)[ends_ind_idx]
    return gap, ends_ind, ends_ind_idx, ends_ind_org

ends_gap(poss, dia_max)

Performance analysis

Here are the performance results on my 6-core machine (with a i5-9600KF processor) on the small dataset:

Initial code with Scipy:             259 s
Initial default code with Numpy:     112 s
Optimized algorithm:                   1.37 s
Final optimized code:                  0.22 s

Unfortunately, the Scipy k-d tree is too big to fit in memory on my machine with the big dataset.

Thus the Numba implementation with an efficient algorithm is up to ~510 times faster than the initial Numpy implementation and ~1200 time faster than the initial Scipy implementation.

The Numba code can be further optimized, but please note that the Numba compute call takes less than 25% of the time on my machine. The np.unique call is the most expensive, but it is not easy to make it faster. A significant part of the time is spent in the Scipy-to-Numba data conversion, but this code is mandatory as long as Scipy is used. Thus, the code can be improved a bit (eg. certainly 2x faster) with advanced Numba optimization but if you need a much faster code, then you need to use a native language like C++ and an highly-optimized parallel k-d tree implementation. I expect a very-optimized native code to be an order of magnitude faster but not much more. I hardly believe the big dataset can be computed in less than 10 ms on my machine regardless of the implementation.


Notes

Note that gap is different with the provided functions (other values are left unchanged). However, the same thing happens between the initial Scipy method and the one of Numpy. This appear to come from the ordering of variables like nears_i_ind and dist_i which is undefined by Scipy and change the gap result in a non-trivial way (not just the order of gap). I am not sure this is a problem of the initial implementation. Because of that, it is much harder to compare the correctness of the different implementations.

forceobj should not be used in production as the documentation states this is only useful for testing purposes.

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
  • Thanks a lot, It was very helpful. I test on some examples and it get same results as mine. For `gap`, I have seen this problem before when I was trying to implement `Scipy` in balanced and unbalanced forms and also `R_tree`. Finally, I think I find the solution, use `return_sorted=True` for `query_ball_point`; let me know if it is not the solution. It seems, `cur_point` and `dia_max` are unnecessary in `compute` function. – Ali_Sh Feb 16 '22 at 20:30
  • @Ali_Sh Using `return_sorted=True` help to generate more reproducible results but I find weird the fact that index ordering can change the result (and not just its order). So I guess the problem comes from the algorithm itself. Good catch for `cur_point` and `dia_max`. Thank you. – Jérôme Richard Feb 17 '22 at 11:46
  • @Ali_Sh As for the Numba code, it can be indeed improved but it does not take a big portion of the time. Actually, I do not expect much more than a 2x speed up including the optimization of the `unique`+`sort` calls which is certainly a bit tough. If you are interested in such an optimization, then I can try what I can do to optimize the code. – Jérôme Richard Feb 17 '22 at 11:51
  • I have tested `return_sorted=True` on several cases. Did you test it and get wrong results again? What was your system *ram* configuration? I have tested your codes on *23.000* data and it shows good differences in runtimes. On my system with *i5 1st gen cpu, 16gb ram* data volumes more than 200.000 (errors for [workers=1](https://drive.google.com/file/d/1Y2-YsmCWWpkUxQCOrb_IIse7sTfpvQiR/view?usp=sharing) and [workers=-1](https://drive.google.com/file/d/15GNLYxQmkPBEAq2-LzUKHhJ3oNucYKE6/view?usp=sharing)) and on *colab TPU* more than 100.000 will stuck for `numba` method due to ram limits. – Ali_Sh Feb 17 '22 at 17:34
  • For your *optimized algorithm*, *numba with workers -1*, and *numba with workers 1*, the resulted time for 100,000 data were 45, 22, and 33 sec (~70% ram) on my system and 30, 25, and 29 sec on Colab, respectively, and for 200,000 data volume it takes 101, 56, and 78 sec (~95% ram) on my system. What is the cause of this relatively high difference between your resulted time and mine? and Could ram consumption be controlled by numba method? Numba method needs much more rams comparing your *optimized algorithm* or my numpy method. Based on these ram consumption tests, your system ram must be 64? – Ali_Sh Feb 17 '22 at 17:45
  • Did you test by my prepared data sets i.e. *23.000* and *500.000* or you prepared your own data sets? When I am using *23000* data set, it takes 3.5, 0.5, 0.8 secs which are quiet fast, but when data volume get larger 4 times (e.g. *100.000*), the consuming time increase significantly as I said. It is very strange. Do you know the reason for this? Could it be related to *ram* again???? – Ali_Sh Feb 17 '22 at 18:21
  • I used your datasets (23.000 and 500.000). I did not understand what are the timings: why did you get 3 numbers? Is this 3 different launches? The timings are dependent of the processor and the memory hierarchy (eg. speed of the RAM and the possible use of swapped memory). Smaller data structure that fit in cache are faster. The use of swap can make the program run much slower. AFAIK, Numba does not provide anything to control that. – Jérôme Richard Feb 17 '22 at 18:38
  • The first time I ran the big dataset, Scipy used more than 12 GiB of RAM and I just killed it (it was very slow). Subsequent run worked while using a reasonable amount of RAM (AFAIR <1GiB). I think it was a bug of Scipy but I did not succeed to reproduce it on my machine yet. My main computing machine has 16 GiB of RAM. Regarding your input, the output can be quite big. This is very dependent of the amount of neighbour per sphere. If this number is big, then the memory usage will be quite big. – Jérôme Richard Feb 17 '22 at 18:44
  • I meant: 1. *optimized algorithm* 2. *numba with workers -1*, and 3. *numba with workers 1*, so, for *45, 22, and 33 sec*, 45 refers to *optimized algorithm*, 22 refers to *numba with workers -1*, and 33 refers to *numba with workers 1*. My rams are cl7 or cl9, dual DDR3, ripjaws (fast timing). I didn't figure out why I get near the same results on both colab and my machine which have a fairly large difference comparing to yours. Scipy uses rams highly, and may be it is better to use numpy instead. The optimized algorithm used maximum of 40% of my rams. – Ali_Sh Feb 17 '22 at 19:06
  • This is quite weird, On my machine `query_ball_point` scale well and the results are 6.32, 1.44 and 3.28 on the big dataset you provided. I use Scipy 1.6.3 and Numba 0.54 as well as Numpy 1.20.3. The building of the k-d tree takes 173 ms and `query_ball_point` takes 462 ms (with `workers=-1`. The whole process does not take more than 300 MiB. Thus, I guess there is a problem with Scipy on your machine. Can you check the result of the k-d tree building and that the memory issue comes from this? Do you have a recent version of Scipy? Btw I have 2 channels DDR4 @ 3200MHz reaching 36~40 GiB/s. – Jérôme Richard Feb 17 '22 at 23:18
  • 1
    ram consumption increasing continuously during the execution. I have tested by various scipy (-1.6 on colab, 1.7.2 and 1.8) and numba (0.53 and 0.55) versions. I have opened a [scipy issue](https://github.com/scipy/scipy/issues/15621) and would be appreciated if you would accompany to solve the problem. – Ali_Sh Feb 18 '22 at 13:57
5

Based on previous answers, I designed a efficient algorithm with a much lower memory footprint and much faster than the previous ones (especially on the large dataset). That being said this algorithm is far move complex and push the limit of Python and Numba.

The key issue of previous algorithms is that they set a dia_max threshold which is much bigger than actually required. Indeed, dia_max is set to the maximum possible redius so to be sure not to miss any overlapping. The thing is the big dataset contains balls of very different size and some of them are huge. This means that previous algorithms was fetching for a very large radius around many small balls. The result was thousands of neighbours to check per ball while only few can truly overlap.

One solution to efficiently address this problem is to split the balls in different groups based on their size. The idea is to first sort balls based on radii, then split the sorted balls in two groups, then independently query neighbours between each possible pair of groups, then merge data so to apply the previous algorithm (with some additional optimizations). More specifically, the query is applied between small balls with big ones, small balls with other small ones, big balls with other big ones, and big balls with small ones.

Another key point to speed this up is to request the different neighbour queries in parallel using joblib. This solution is far from being perfect since the BallTree object needs to be duplicated which is inefficient but this is mandatory because of the way parallelism is currently done in CPython (ie. GIL, pickling, etc.). Using a package that support parallel request can bypass this inherent limitation of CPython but existing package doing that does not seems to provide an interface sufficiently useful to address this problem or are not optimized enough to be actually useful.

Finally, the Numba code can be strongly optimized by removing almost all very expensive (implicit) array allocations. Using a in-place sorting algorithm optimized for small array also improve significantly the execution time (mainly because the default implementation of Numba perform several expensive allocations and is not optimized for small arrays). In addition, the final np.unique operation can be completely rewritten with a basic loop as the main loop iterate over balls with increasing IDs (hence already sorted).

Here is the resulting code:

import numpy as np
import numba as nb
from sklearn.neighbors import BallTree
from joblib import Parallel, delayed

def flatten_neighbours(arr):
    sizes = np.fromiter(map(len, arr), count=len(arr), dtype=np.int64)
    values = np.concatenate(arr, dtype=np.int64)
    return sizes, values

@delayed
def find_neighbours(searched_pts, ref_pts, max_dist):
    balltree = BallTree(ref_pts, leaf_size=16, metric='euclidean')
    res = balltree.query_radius(searched_pts, r=max_dist)
    return flatten_neighbours(res)

def vstack_neighbours(top_infos, bottom_infos):
    top_sizes, top_values = top_infos
    bottom_sizes, bottom_values = bottom_infos
    return np.concatenate([top_sizes, bottom_sizes]), np.concatenate([top_values, bottom_values])

@nb.njit('(Tuple([int64[::1],int64[::1]]), Tuple([int64[::1],int64[::1]]), int64)')
def hstack_neighbours(left_infos, right_infos, offset):
    left_sizes, left_values = left_infos
    right_sizes, right_values = right_infos
    n = left_sizes.size
    out_sizes = np.empty(n, dtype=np.int64)
    out_values = np.empty(left_values.size + right_values.size, dtype=np.int64)
    left_cur, right_cur, out_cur = 0, 0, 0
    right_values += offset
    for i in range(n):
        left, right = left_sizes[i], right_sizes[i]
        full = left + right
        out_values[out_cur:out_cur+left] = left_values[left_cur:left_cur+left]
        out_values[out_cur+left:out_cur+full] = right_values[right_cur:right_cur+right]
        out_sizes[i] = full
        left_cur += left
        right_cur += right
        out_cur += full
    return out_sizes, out_values

@nb.njit('(int64[::1], int64[::1], int64[::1], int64[::1])')
def reorder_neighbours(in_sizes, in_values, index, reverse_index):
    n = reverse_index.size
    out_sizes = np.empty_like(in_sizes)
    out_values = np.empty_like(in_values)
    in_offsets = np.empty_like(in_sizes)
    s, cur = 0, 0

    for i in range(n):
        in_offsets[i] = s
        s += in_sizes[i]

    for i in range(n):
        in_ind = reverse_index[i]
        size = in_sizes[in_ind]
        in_offset = in_offsets[in_ind]
        out_sizes[i] = size
        for j in range(size):
            out_values[cur+j] = index[in_values[in_offset+j]]
        cur += size

    return out_sizes, out_values

@nb.njit
def small_inplace_sort(arr):
    if len(arr) < 80:
        # Basic insertion sort
        i = 1
        while i < len(arr):
            x = arr[i]
            j = i - 1
            while j >= 0 and arr[j] > x:
                arr[j+1] = arr[j]
                j = j - 1
            arr[j+1] = x
            i += 1
    else:
        arr.sort()

@nb.jit('(float64[:, ::1], float64[::1], int64[::1], int64[::1])')
def compute(poss, radii, neighbours_sizes, neighbours_values):
    n, m = neighbours_sizes.size, np.max(neighbours_sizes)

    # Big buffers allocated with the maximum size.
    # Thank to virtual memory, it does not take more memory can actually needed.
    particle_corsp_overlaps = np.empty(neighbours_values.size, dtype=np.float64)
    ends_ind_org = np.empty((neighbours_values.size, 2), dtype=np.float64)

    in_offset = 0
    out_offset = 0

    buff1 = np.empty(m, dtype=np.int64)
    buff2 = np.empty(m, dtype=np.float64)
    buff3 = np.empty(m, dtype=np.float64)

    for particle_idx in range(n):
        size = neighbours_sizes[particle_idx]
        cur = 0

        for i in range(size):
            value = neighbours_values[in_offset+i]
            if value != particle_idx:
                buff1[cur] = value
                cur += 1

        nears_i_ind = buff1[0:cur]
        small_inplace_sort(nears_i_ind)  # Note: bottleneck of this function
        in_offset += size

        if len(nears_i_ind) == 0:
            continue

        x1, y1, z1 = poss[particle_idx]
        cur = 0

        for i in range(len(nears_i_ind)):
            index = nears_i_ind[i]
            x2, y2, z2 = poss[index]
            dist = np.sqrt((x2 - x1) ** 2 + (y2 - y1) ** 2 + (z2 - z1) ** 2)
            contact_check = dist - (radii[index] + radii[particle_idx])
            if contact_check <= 0.0:
                buff2[cur] = contact_check
                buff3[cur] = index
                cur += 1

        particle_corsp_overlaps[out_offset:out_offset+cur] = buff2[0:cur]

        contacts_sec_ind = buff3[0:cur]
        small_inplace_sort(contacts_sec_ind)
        sphere_olps_ind = contacts_sec_ind

        for i in range(cur):
            ends_ind_org[out_offset+i, 0] = particle_idx
            ends_ind_org[out_offset+i, 1] = sphere_olps_ind[i]

        out_offset += cur

    # Truncate the views to their real size
    particle_corsp_overlaps = particle_corsp_overlaps[:out_offset]
    ends_ind_org = ends_ind_org[:out_offset]

    assert len(ends_ind_org) % 2 == 0
    size = len(ends_ind_org)//2
    ends_ind = np.empty((size,2), dtype=np.int64)
    ends_ind_idx = np.empty(size, dtype=np.int64)
    gap = np.empty(size, dtype=np.float64)
    cur = 0

    # Find efficiently duplicates (replace np.unique+np.sort)
    for i in range(len(ends_ind_org)):
        left, right = ends_ind_org[i]
        if left < right:
            ends_ind[cur, 0] = left
            ends_ind[cur, 1] = right
            ends_ind_idx[cur] = i
            gap[cur] = particle_corsp_overlaps[i]
            cur += 1

    return gap, ends_ind, ends_ind_idx, ends_ind_org

def ends_gap(poss, radii):
    assert poss.size >= 1

    # Sort the balls
    index = np.argsort(radii)
    reverse_index = np.empty(index.size, np.int64)
    reverse_index[index] = np.arange(index.size, dtype=np.int64)
    sorted_poss = poss[index]
    sorted_radii = radii[index]

    # Split them in two groups: the small and the big ones
    split_ind = len(radii) * 3 // 4
    small_poss, big_poss = np.split(sorted_poss, [split_ind])
    small_radii, big_radii = np.split(sorted_radii, [split_ind])
    max_small_radii = sorted_radii[max(split_ind, 0)]
    max_big_radii = sorted_radii[-1]

    # Find the neighbours in parallel
    result = Parallel(n_jobs=4, backend='threading')([
        find_neighbours(small_poss, small_poss, small_radii+max_small_radii),
        find_neighbours(small_poss, big_poss,   small_radii+max_big_radii  ),
        find_neighbours(big_poss,   small_poss, big_radii+max_small_radii  ),
        find_neighbours(big_poss,   big_poss,   big_radii+max_big_radii    )
    ])
    small_small_neighbours = result[0]
    small_big_neighbours = result[1]
    big_small_neighbours = result[2]
    big_big_neighbours = result[3]

    # Merge the (segmented) arrays in a big one
    neighbours_sizes, neighbours_values = vstack_neighbours(
        hstack_neighbours(small_small_neighbours, small_big_neighbours, split_ind),
        hstack_neighbours(big_small_neighbours, big_big_neighbours, split_ind)
    )

    # Reverse the indices.
    # Note that the results in `neighbours_values` associated to 
    # `neighbours_sizes[i]` are subsets of `query_radius([poss[i]], r=dia_max)`
    # on a `BallTree(poss)`.
    res = reorder_neighbours(neighbours_sizes, neighbours_values, index, reverse_index)
    neighbours_sizes, neighbours_values = res

    # Finally compute the neighbours with a method similar to the 
    # previous one, but using a much faster optimized code.
    return compute(poss, radii, neighbours_sizes, neighbours_values)

result = ends_gap(poss, radii)

Here is the results (still on the same i5-9600KF machine):

Small dataset:
 - Reference optimized Numba code:    256 ms
 - This highly-optimized Numba code:   82 ms

Big dataset:
 - Reference optimized Numba code:    42.7 s  (take about 7~8 GiB of RAM)
 - This highly-optimized Numba code:   4.2 s  (take about  1  GiB of RAM)

Thus the new algorithm is about 3.1 time faster on the small dataset (in addition to the previous optimizations), and about 10 times faster on the big dataset! This is 3 order of magnitude faster than the initially posted algorithms.

Note that 80% of the time is spend in the BallTree query (which is already mostly parallel). The main Numba computing function takes only 12% of the time and more than 75% of the time is spent in sorting the input indices. As a result, the neighbourhood search is clearly the bottleneck. It can be improved a bit by splitting the current queries in multiple smaller one but this will make the code even more complex for a relatively small improvement (eg. 1.5x faster). Note that more complex code are harder to maintain and modifications are bug-prone. Thus, I think moving to a native language to overcome the limitation of Python is the best solution to increase performance. That being said, writing a faster native code to solve this problem is far from being simple (unless you find good k-d tree, octree or ball tree library). Still, it is certainly better than optimizing this code further.


Analysis

A profiling analysis shows that at least 50% of the time in BallTree of scikit-learn is spent in unoptimized scalar loops that could use SIMD instructions like AVX-2 (and loop unrolling) to be about 4 times faster. Additionally, some multi-threading issue are also visible (the 4 threads on the top are the joblib workers, the light-green sections are the idle time):

profiling

This shows that this implementation is sub-optimal. One possible way to easily improve the execution time may be to optimize the hot loops of the scikit-learn BallTree implementation. Another strategy could be to try to use threads more efficiently (possibly by releasing the GIL in some parts of the scikit-learn module).

As the BallTree class of scikit-learn is written in Cython (BallTree is based on DKTree itself based on BinaryTree). You can try to rebuild the package on your machine and simply tweak compiler optimizations. Using the parameter -O3 -march=native -ffast-math should enable the compiler to use faster SIMD instruction and more aggressive optimizations resulting in a significant speed up. Note that using -ffast-math is unsafe as it assume the code of Scikit will never use NaN, Inf or -0 values (otherwise the result is completely undefined) and that floating-point number operations are associative (resulting in different results). That being said, such an option is critical to improve the automatic vectorization of numerical codes.

For the GIL, one can see that it is released in the query_radius function but it does not seems the case for the constructor of BallTree. Maybe, the simplest solution is to implement a parallel version of query/query_radius like Scipy did.

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
  • Thanks for the answer. I have a question: when I copy your code in a python file and call the `ends_gap` function twice repeatedly (in one run), the both consume equal times. I put all of the proposed functions in this issue into a [comprehensive function (line 100-300)](https://colab.research.google.com/drive/1D5c8Dv0DUhWwmwpZhmwkLSlYJe2fIyPx?usp=sharing) to have the chance to select any desired function. Now, when I run this function twice (for your last code) as before, the first function consume nearly one and half to twice as much time than the second call. – Ali_Sh Mar 15 '22 at 06:14
  • It seems, the code did not pre compiled by the signatures at the first call but the second call is compiled (due to the first call). Do you have any idea why this happening and how to solve this issue if possible? – Ali_Sh Mar 15 '22 at 06:14
  • An unrelated question: It takes twice as long when using this comprehensive function (best pre-compiled related time, i.e. the second call resulted duration) comparing to the case that we just put your proposed code into a file (not the comprehensive function). E.g. for a case, first and second call of the comprehensive one takes 9 and 6 seconds correspondingly and using just your code will take 3 seconds. Is this an inevitable waste of time if we want to create a utils file that contains some different comprehensive functions? I would be grateful for any recommendations. – Ali_Sh Mar 15 '22 at 06:14
  • 1
    You should not use Numba JIT decorator inside functions, otherwise Numba will have to recompile the function over and over because it cannot know that the code of the function is the same (in fact even if the code is the same, the global variable can reference different values so it needs to re-perform most of the work anyway). It is also better to avoid having big lambda/closure in the code. That being said, you can use @generated_jit if you really want to do that with Numba. The cost you see is the compilation time. – Jérôme Richard Mar 15 '22 at 19:17
  • I did not understand the "unrelated question". What do you means by "comprehensive functions"? – Jérôme Richard Mar 15 '22 at 19:28
2

By fixing the query radius at twice the max sphere radius, you're creating a lot of spurious "collisions" to filter out.

The Python below achieves a significant speedup relative to your answer by using a fourth dimension to improve the selectivity of the kd-tree queries. Each Euclidean ball of radius r is over-approximated by an L1 ball of radius r√d where d is the dimension (3 here). The test for L1 balls colliding in 3d becomes a test for points being within a fixed L1 distance in 4d.

If you switched to a lower level language, you could potentially avoid a separate filtering step by altering the kd-tree implementation to use a combination L2+L1 metric.

import numpy as np
from scipy import spatial
from timeit import default_timer


def load_data():
    centers = np.loadtxt("pos_large.csv", delimiter=",")
    radii = np.loadtxt("radii_large.csv")
    assert radii.shape + (3,) == centers.shape
    return centers, radii


def count_contacts(centers, radii):
    scaled_centers = centers / np.sqrt(centers.shape[1])
    max_radius = radii.max()
    tree = spatial.cKDTree(np.c_[scaled_centers, max_radius - radii])
    count = 0
    for i, x in enumerate(np.c_[scaled_centers, radii - max_radius]):
        for j in tree.query_ball_point(x, r=2 * max_radius, p=1):
            d = centers[i] - centers[j]
            r = radii[i] + radii[j]
            if i < j and np.inner(d, d) <= r * r:
                count += 1
    return count


def main():
    centers, radii = load_data()
    start = default_timer()
    print(count_contacts(centers, radii))
    end = default_timer()
    print(end - start)


if __name__ == "__main__":
    main()
David Eisenstat
  • 64,237
  • 7
  • 60
  • 120
  • Thanks for pointing to a very good point to remove excess nearest points, that will help in memory consumption reduction and may leads to faster code, consequently. Your written code is not complete and I don't know what goal you want to show by evaluating time or by which section of the codes in my answer you want to compare? The count of the contacts differ by my written answer (using `len(ends_ind)`), if `count ` showing twice the contacts number. `p=1` seems to be faster than `p=2` *by the same results*? Please compare the resulted runtimes by my answer's codes; In my tests it was slower. – Ali_Sh Mar 07 '22 at 22:50
  • @Ali_Sh Change `i < j` to `i != j` then. This code was much faster on my laptop than your "Numba final optimized code". I'm assuming that you measured after changing `p=1` to `p=2`, which defeats the optimization here. We still have perfect recall with `p=1` because the kd-tree points are scaled. – David Eisenstat Mar 08 '22 at 00:12
  • `i < j` was correct; I made a mistake that weird the radii. `p=1` is well. I have compared the speed of your code by the complete Numba code in [colab](https://colab.research.google.com/drive/1mXZQmxlsoVVJbA9TCWSUsEtBPMl37ZLN?usp=sharing) and Numba was faster. Does it need any other changes or any accelerator? I have modified your code just to import `poss` and `radii` to the `load_data` function from the outside of the function. – Ali_Sh Mar 08 '22 at 04:57
1

As an update to Richard answer and to overcome probable memory leaks, I post this answer. During my testing executions, memory usage grows up and limits the execution to some smaller data volumes (maximum 200000 by my machine and 100000 on COLAB). This problem leads to much longer runtimes than resulted runtimes by Richard. So, I opened a SciPy issue relating to these different performances and put and compared some memory results there.
But, I did not get any answer so far and the origin of these significant differences between performances are not clear to me yet !!??

Fezzani referred to another SciPy issue to use chunk and well prepared a comparison to show the influence of chunk values on the runtimes. Strangely, although Fezzani's machine (Intel® Core™ i7-10700K CPU @ 3.80GHz × 16; 32GiB of RAM) seems to be more powerful than Richard's machine (6-core machine with a i5-9600KF processor, 16 GiB of RAM 2 channels DDR4 @ 3200MHz reaching 36~40 GiB/s), His execution on the large data will take at least (around) 33 seconds by chunk method (to avoid memory leaks).
I could not figure out why and which hardware can help machines to pass memory leaks and result in satisfying fast execution as for Richard (perhaps it was related to KF type of Richard's CPU) !!??


By seeking among some related memory issues, I could guess cKDTree methods are facing this inevitable problem when data volume is huge or … and scikit-learn, perhaps, be a better choice. In this regard, based on my understanding from JaminSore answer and the referred Martelli answer, I tried to evaluate BallTree and KDTree from scikit-learn. BallTree has better performance than KDTree in my cases (about 1.5 to 2 times), so I used it. There were no memory leaks for the large data, but it took 2 minutes (Richard results and mine differ just in time units now ;)). It ran faster than scipy when data volume increased. In my tests, scipy was faster on smaller data volumes (low memory consumptions) and as data volumes grows up, scipy performance falls behind due to its implementation behavior or related bugs (unclear to me yet); For my prepared 100000 data volumes, scikit-learn performs 1.5 to 2 times faster.

I guess using arrays is the big advantage of scikit-learn comparing to scipy method's lists, which can be derived from aforementioned Martelli answer. It may be the reason of the different performances.

scikit-learn methods return an object type ndarray with arrays of different lengths inside it that need to be sorted to get same results as the main code. I applied the related sorting behavior of each element in the loop in the compute function by modifying nears_i_ind code-line as nears_i_ind = np.sort(all_neighbours[an_offset:an_offset+cur_len]). Using BallTree, tmp and all_neighbours consume memory near the same. Note: If both have the same name, memory consumption will be reduced (almost halved). So, the modified Richard's ends_gap function by BallTree will be as:

def ends_gap(poss, dia_max):
    balltree = BallTree(poss, metric='euclidean')

    # tmp = balltree.query_radius(poss, r=dia_max)
    # all_neighbours = np.concatenate(tmp, dtype=np.int64)
    all_neighbours = balltree.query_radius(poss, r=dia_max)
    all_neighbours_sizes = np.array([len(e) for e in all_neighbours], dtype=np.int64)
    all_neighbours = np.concatenate(all_neighbours, dtype=np.int64)

    particle_corsp_overlaps, ends_ind_lst = compute(poss, all_neighbours, all_neighbours_sizes)
    ends_ind_org = np.concatenate(ends_ind_lst)
    ends_ind, ends_ind_idx = np.unique(np.sort(ends_ind_org), axis=0, return_index=True)
    gap = np.concatenate(particle_corsp_overlaps)[ends_ind_idx]
    return gap, ends_ind, ends_ind_idx, ends_ind_org

It is not multi-threaded, which can improve the speed; I will try to multi-thread.
On my machine (i5 1st gen cpu intel core 760 @ 2.8GHz, 16gb ram cl9 dual channel DDR3 ripjaws, x64 windows system) for 200000 data volume:

enter image description here


There were some mistakes in my two proposed methods which result in different gap values, which was mentioned in Note section by Richard. For producing same results, return_sorted=True must be added for nears_i_ind in Optimized algorithm and ends_ind and ends_ind_lst changes to list beside removing if-else statements in both codes:

Optimized algorithm:

def ends_gap(poss, dia_max):
    particle_corsp_overlaps = []
    ends_ind = []                                                                       # <------- this line is modified

    kdtree = cKDTree(poss)

    for particle_idx in range(len(poss)):
        cur_point = poss[particle_idx]
        nears_i_ind = np.array(kdtree.query_ball_point(cur_point, r=dia_max, return_sorted=True), dtype=np.int64)       # <------- this line is modified
        assert len(nears_i_ind) > 0

        if len(nears_i_ind) <= 1:
            continue

        nears_i_ind = nears_i_ind[nears_i_ind != particle_idx]
        dist_i = distance.cdist(poss[nears_i_ind], cur_point[None, :]).squeeze()

        contact_check = dist_i - (radii[nears_i_ind] + radii[particle_idx])
        connected = contact_check[contact_check <= 0]

        particle_corsp_overlaps.append(connected)

        contacts_ind = np.where([contact_check <= 0])[1]
        contacts_sec_ind = nears_i_ind[contacts_ind]
        sphere_olps_ind = np.sort(contacts_sec_ind)

        ends_ind_mod_temp = np.array([np.repeat(particle_idx, len(sphere_olps_ind)), sphere_olps_ind], dtype=np.int64).T
        ends_ind.append(ends_ind_mod_temp)                                              # <------- this line is modified

    ends_ind_org = np.concatenate(ends_ind)
    ends_ind, ends_ind_idx = np.unique(np.sort(ends_ind_org), axis=0, return_index=True)
    gap = np.concatenate(particle_corsp_overlaps)[ends_ind_idx]
    return gap, ends_ind, ends_ind_idx, ends_ind_org

Numba final optimized code:

@nb.jit('(float64[:, ::1], int64[::1], int64[::1])')
def compute(poss, all_neighbours, all_neighbours_sizes):
    particle_corsp_overlaps = []
    ends_ind_lst = []                                                                   # <------- this line is modified
    an_offset = 0

    for particle_idx in range(len(poss)):
        cur_len = all_neighbours_sizes[particle_idx]
        nears_i_ind = np.sort(all_neighbours[an_offset:an_offset+cur_len])              # <------- this line is modified
        an_offset += cur_len
        assert len(nears_i_ind) > 0

        if len(nears_i_ind) <= 1:
            continue

        nears_i_ind = nears_i_ind[nears_i_ind != particle_idx]
        dist_i = np.empty(len(nears_i_ind), dtype=np.float64)

        x1, y1, z1 = poss[particle_idx]
        for i in range(len(nears_i_ind)):
            x2, y2, z2 = poss[nears_i_ind[i]]
            dist_i[i] = np.sqrt((x2 - x1) ** 2 + (y2 - y1) ** 2 + (z2 - z1) ** 2)

        contact_check = dist_i - (radii[nears_i_ind] + radii[particle_idx])
        connected = contact_check[contact_check <= 0]
        particle_corsp_overlaps.append(connected)

        contacts_ind = np.where(contact_check <= 0)
        contacts_sec_ind = nears_i_ind[contacts_ind]
        sphere_olps_ind = np.sort(contacts_sec_ind)

        ends_ind_mod_temp = np.empty((len(sphere_olps_ind), 2), dtype=np.int64)
        for i in range(len(sphere_olps_ind)):
            ends_ind_mod_temp[i, 0] = particle_idx
            ends_ind_mod_temp[i, 1] = sphere_olps_ind[i]
        ends_ind_lst.append(ends_ind_mod_temp)                                          # <------- this line is modified

    return particle_corsp_overlaps, ends_ind_lst


def ends_gap(poss, dia_max):
    balltree = BallTree(poss, metric='euclidean')                                       # <------- new code
    all_neighbours = balltree.query_radius(poss, r=dia_max)                             # <------- new code and modified
    all_neighbours_sizes = np.array([len(e) for e in all_neighbours], dtype=np.int64)   # <------- this line is modified
    all_neighbours = np.concatenate(all_neighbours, dtype=np.int64)                     # <------- this line is modified
    particle_corsp_overlaps, ends_ind_lst = compute(poss, all_neighbours, all_neighbours_sizes)
    ends_ind_org = np.concatenate(ends_ind_lst)
    ends_ind, ends_ind_idx = np.unique(np.sort(ends_ind_org), axis=0, return_index=True)
    gap = np.concatenate(particle_corsp_overlaps)[ends_ind_idx]
    return gap, ends_ind, ends_ind_idx, ends_ind_org

On my machine for around 550000 data volume:

enter image description here

Ali_Sh
  • 2,667
  • 3
  • 43
  • 66
  • Very interesting. I am still quite surprised by the high memory consumption of `query_radius` although it is a bit better than the one of Scipy. This could be due to the use of multiple threads though (either by design or dynamically). – Jérôme Richard Mar 11 '22 at 17:09
  • @JérômeRichard, There was no difference between `workers=1` and `workers=-1` in terms of memory consumption, if you mean by *by design*. – Ali_Sh Mar 11 '22 at 23:04
  • 1
    Ok. This was for the "dynamically" case. For the "by design", I mean that the possible use of multiple threads add constraints on the choice of data structures in the code that can increase the memory consumption. This is quite frequent in parallel computing. People does not often write a specific code for the sequential case when they write a parallel implementation. That being said, I come to think the Scipy implementation is simply not optimized at all... – Jérôme Richard Mar 11 '22 at 23:40
  • I just found out that I could not reproduce the result of the big dataset and that is because I used a wrong `dia_max` value during the benchmark: it was the one of the small dataset and not the one of the big one. This explain the observed difference so far: I confirm that Scipy use too much memory on my machine with a bigger dia_max. Note that dia_max was the same for all version so the comparison was fine, but the input was not the one you want. Thus, I edited my previous answer to remove the big dataset benchmark and added some precision about the memory usage. – Jérôme Richard Mar 12 '22 at 17:34
  • In fact, the high value of `dia_max` in the big dataset is the root of the problem: it causes the result of the k-d/ball trees to be HUGE because of the very large number of neighbours hence a slow computation and a high memory usage. – Jérôme Richard Mar 12 '22 at 17:36
0

Have you tried FLANN?

This code doesn't solve your problem completely. It simply finds the nearest 50 neighbors to each point in your 500000 point dataset:

from pyflann import FLANN

p = np.loadtxt("pos_large.csv", delimiter=",")
flann = FLANN()
flann.build_index(pts=p)
idx, dist = flann.nn_index(qpts=p, num_neighbors=50)

The last line takes less than a second in my laptop without any tuning or parallelization.

aerobiomat
  • 2,883
  • 1
  • 16
  • 19
  • I did not know *FLANN*. I try it as you mentioned, It is very fast and could be very helpful if get the true wanted results. [In my test](https://colab.research.google.com/drive/1KzhnFyF_fjdf8taHFGWnHL6352niHLR1?usp=sharing), the results for `idx` and `dist` are differ from my needs and I could not figure out what are the results showing? I would appreciate for more explanation and updating your answer by the test example prepared in this comment, *if it is capable to get the needed results.* *Notes: In the test results, just the distances must be compared, not the printed indices.* – Ali_Sh Feb 15 '22 at 09:56
  • @Ali_Sh Be aware that ANN means Approximate Nearest Neighbor search, i.e. it may skip some of the nearest neighbors and not return them. You have to decide whether missing some neighbors is fine in your case. – TilmannZ Feb 16 '22 at 11:30
  • @TilmannZ In my case, unfortunately, I need the exact positions not approximations. From the prepared test's results, All `FLANN`'s results are different from the exact ones. I will investigate it again besides some other libraries that I just met them e.g. `annoy`, `ngt`, and … based on [the testing benchmarks](http://ann-benchmarks.com/) prepared by ***Erik Bernhardsson***. – Ali_Sh Feb 16 '22 at 21:03