Closest pair in n dimension

Question

Is there any efficient library function in python to find a pair of points which are closest/farthest? The input is a list of points which lie in a k-dimensional cube of edge size 1. For finding the closest, the following code is taking too much time. TC: O( n**2 * k ) where n is the size of input. In my case, input n is around 4000 and value of k is around 300.

def euclid_dist( p1, p2 ):
    val = 0
    for i in range(len(p1)):
        val += (p1[i]-p2[i])**2
    return val

def find_close_cluster( points ):
    ans1 = 0
    ans2 = 1
    min_dist = 1000000000000
    for i in range( len(clusters) ):
        for j in range( i+1,len(clusters)):
            current_dist = euclid_dist(clusters[i],clusters[j])
            if( current_dist < min_dist ):
                ans1 = i
                ans2 = j
                min_dist = current_dist
    return ( ans1, ans2 )

For minimum distance, this might be of interest to you: http://stackoverflow.com/questions/3700983/what-is-the-fastest-algorithm-to-calculate-the-minimum-distance-between-two-sets — le_m, Apr 11 '17 at 17:33
Are you testing each pair of points twice? For example for points p1 and p2, you are measuring from p1 to p2 and also from p2 to p1. Once a point is checked against all point it should be removed. — CodeCupboard, Apr 11 '17 at 17:36
@churchwalk, no. The 2nd for loop starts with j = i+1. So distance between two points are not calculated more than once. — Abhishek, Apr 11 '17 at 17:44
First - Euler was a great mathematician but the distance function is named after Euclid. Second - because of the curse of dimensionality the distance is probably useless unless your data has some special characteristics. Usually a kd-tree would be the way to go, but with k=300 this will not work out. brute force is probably the best approach for n=4000. — stefan, Apr 12 '17 at 00:21

score 2 · Accepted Answer · answered Apr 11 '17 at 18:09

You should use numpy ndarrays and the scipy.spatial.distance.cdist function. numpy gives you a efficient container to manipulate data in a vectorized form, so the code runs much faster than doing for loops on iterators or lists. The scipy.spatial.distance.cdist uses numpy arrays to compute the distances between all elements. Take a look on the documentation for more details. The code below should work:

import numpy as np
from scipy.spatial.distance import cdist

your_data = np.asarray([[first_sample], [second_sample], [...])
d = cdist(your_data, your_data)
number_samples = your_data.shape[0]
# Select d(a_i,a_j), i != j without repetitions
il1 = np.tril_indices(number_samples, k=-1) 
dist = d[il1]
arg_min = dist.argmin()
min = dist.min()
arg_max = dist.argmax()
max = dist.max()

how does this compare to DnC method? – Raksha Apr 17 '18 at 16:00 — Raksha, Apr 17 '18 at 16:00

Closest pair in n dimension

1 Answers1