5

I've been told throughout many references that KDTree is a fast way to find nearest neighbors for large data. My current issue is to find the nearest points in X for a given data A. To elaborate, currently, X has 1,000,000 numerical data and A consists of 10,000. I want to find the nearest point in X for each of the points in A. Hence, the result should be 10,000 indices indicating the data points in X.

When I used cdist (from scipy.spatial) with a for loop to find the nearest point for each data in A, it took about half an hour (1972 seconds) while cKDTree.query took about 50 minutes (2839 seconds) when using n_jobs = 4.

the code for cdist is as below:

t = time.time() 
nn = np.array([])
jump = 1000
nloop = np.ceil(A.shape[0]/jump).astype(int)
for ii in range(nloop):
   temp = cdist(X, A[ii*jump:(ii+1)*jump])
   nn = np.append(nn, np.argmin(temp, axis = 0))
print('Elapsed time: ', time.time() - t) # this was 1926 seconds (a little bit faster than one by one loop)

The code for cKDTree is as below:

t = time.time()
tree = cKDTree(X)
nn = tree.query(A, 1, n_jobs = 4)[1]
print('Elapsed time: ', time.time() - t) # this is 2839 seconds
  1. I'm curious if this is normal and
  2. at what circumstances should cKDTree be used if cdist is actually faster when it comes to finding the nearest neighbors by computing the distance anyway? Should KDTree be better if I used a whole lot bigger data set A?
  3. Is there a way to find the indices only when querying for nearest point (k=1) without calculating the distance? My guess is that the distance calculation slows it down by a lot (of course it's just a guess at this point)
Cody Chung
  • 629
  • 1
  • 6
  • 15

0 Answers0