0

I have n points with me and I have to compute the euclidean distance between each point and the remaining n-1 points. I have used the following way to do it in python:

for eachRow in range(0, numberOfPoints):
        distanceProximityMatrix.append([])

    print('Initialisation Completed')
    for i in range(0, numberOfPoints):
        if(i%100 == 0) : print('.', end = '')
        for j in range(i, numberOfPoints):
            if(i != j):
                tempDist = distanceForMultivariate(recordsList[i], recordsList[j], attributesToBeUsed, isFirstColumnID = isFirstColumnID)
                distanceProximityMatrix[i].append(tempDist) 
                distanceProximityMatrix[j].append(tempDist)
            else :
                distanceProximityMatrix[i].append(0)

Is there any faster way to do this as the number of points I am having is quite large and this strategy takes a large amount of time.

Note : The distanceForMultivariate function calculates the euclidean distance.

Sushodhan
  • 400
  • 1
  • 6
  • 16

2 Answers2

3

I am assuming 2D points here. Then euclidean distance is:

sqrt( (x1 - x2)^2 + (y1 - y2)^2 )

We have the following operations here:

  • 2 subtractions
  • 2 multiplications
  • 1 addition
  • 1 sqrt

If you only need to compare distances (for example, to find closest neighbors), you can drop the sqrt entirely because it preserves the order. Be careful that they don't get to large though, if you want to sum the values later they could become quite big.

Triangle equation does NOT hold, so don't use it where this is necessary (so no pathfinding or basically anywhere where you would sum the distances!):

if sqrt(a) + sqrt(b) >= sqrt(c), then
a + b <= a + 2sqrt(a*b) + b = (sqrt(a) + sqrt(b)) ^2 >= sqrt(c)^2 = c

sqrt(100) + sqrt(1) >= sqrt(121) but 100 + 1 < 121

That being said, I don't think you can reduce the complexity if you really need all the distances, because then you are, no matter what, computing O(n^2) values.

[Update since the application is now clear]

While I think my solution works for finding the nearest neighbors, there are actually better algorithms that solve the problem then to compute some distance for all the point pairs. For example, kd-trees.

The answers to this question may help: How to efficiently find k-nearest neighbours in high-dimensional data?

kutschkem
  • 7,826
  • 3
  • 21
  • 56
0

If it's just the intention to find the closest k points, what do you think of this one?
Start by putting the first k points into some sorted array (based on distance to your source point), and calculate the maximum distance, call this d_max.
For every new point p, do following check:

if (x_p - x_start > d_max) or (y_p - y_start > d_max)
then disregard(x)
else:
  d = distance (x, start);
  if d < d_max 
  then:
    insert_into_array(x) // obviously the array must stay sorted
    d_max = distance(array[k],start)

The idea behind is the following: if the difference between the X-coordinates or the Y-coordinates is larger than the maximal distance, then the distance will be larger too.

E.g.
Imagine your starting point is (2,2), and you have already added (2,6), (2,3) and (3,2), then d_max will be 4. Your other points are (10,0), (0,20) and (5,6), then the following will happen:

Add (10,0)? No, because 10 - 2 > 4 (x_p - x_start > d_max)
Add (0,20)? No, because 20 - 2 > 4 (y_p - y_start > d_max)
Add (5,6) ? Maybe: 5 - 2 <= d_max (X-coordinates) => ok
                   6 - 2 <= d_max (Y-coordinates) => ok
                   distance((5,6),(2,2)) = 5, which is larger than 4 => don't add (5,6)

Obviously, you need to create some kind of "array":

  • where you can add a point, somewhere in the middle, so that the others shift accordingly (linked list).
  • in case you add a point and you already have k entries, the last entry should be removed.

As you only need to compare the distances, there's no need to calculate the square root.

Dominique
  • 16,450
  • 15
  • 56
  • 112