I've been told throughout many references that KDTree is a fast way to find nearest neighbors for large data. My current issue is to find the nearest points in X for a given data A. To elaborate, currently, X has 1,000,000 numerical data and A consists of 10,000. I want to find the nearest point in X for each of the points in A. Hence, the result should be 10,000 indices indicating the data points in X.
When I used cdist (from scipy.spatial) with a for loop to find the nearest point for each data in A, it took about half an hour (1972 seconds) while cKDTree.query took about 50 minutes (2839 seconds) when using n_jobs = 4.
the code for cdist is as below:
t = time.time()
nn = np.array([])
jump = 1000
nloop = np.ceil(A.shape[0]/jump).astype(int)
for ii in range(nloop):
temp = cdist(X, A[ii*jump:(ii+1)*jump])
nn = np.append(nn, np.argmin(temp, axis = 0))
print('Elapsed time: ', time.time() - t) # this was 1926 seconds (a little bit faster than one by one loop)
The code for cKDTree is as below:
t = time.time()
tree = cKDTree(X)
nn = tree.query(A, 1, n_jobs = 4)[1]
print('Elapsed time: ', time.time() - t) # this is 2839 seconds
- I'm curious if this is normal and
- at what circumstances should cKDTree be used if cdist is actually faster when it comes to finding the nearest neighbors by computing the distance anyway? Should KDTree be better if I used a whole lot bigger data set A?
- Is there a way to find the indices only when querying for nearest point (k=1) without calculating the distance? My guess is that the distance calculation slows it down by a lot (of course it's just a guess at this point)