1

I have a (rather large) set of gps points with latitude and longitude, plus a name; for example:

A    30.22    20.45
B    31.00    20.45
...

Now I need to build a matrix that will tell me if A is "close" (<1km) to B. But I don't need to calculate every single pair, because if A is close to B, then B is close to A. What would be the best way to build this matrix (or half of it really) without performing all calculations?

Dervin Thunk
  • 19,515
  • 28
  • 127
  • 217
  • Is point B "expected" (in any sense) to be closer to A than, say, point T? In other words, is there any kind of order to the list at the moment? If not, building half the matrix might save half the work but the problem will still be O(n^2). – FiddleStix Mar 06 '20 at 11:37
  • @FiddleStix: No, they're not ordered in any way. – Dervin Thunk Mar 06 '20 at 11:49
  • 1
    The problem can be solved in O(nlogn), and that's the optimal solution, you can't do better than that – Riccardo Bucco Mar 06 '20 at 11:51
  • @DervinThunk I updated my answer, trying to convince you that you can't improve the brute force solution too much – Riccardo Bucco Mar 06 '20 at 13:42

1 Answers1

1

The problem is very similar to the well known closest pair of points problem. You can find some solutions here. In that case, the optimal solution can be found in O(nlogn) time. However, I think this is not the case.

For example, all your points might be inside the same circle of radius < 0.5, i.e. all the pairs of points are "close" (distance < 1km). In that case you need to at least generate all of them, an that has the same complexity of finding all the combinations of different points.

You could try with a brute force approach that check all the combinations (in this way you either check the pair (A,B) or (B,A)):

from itertools import combinations

def dist(a, b):
    return sqrt((a[1] - b[1])^2 + (a[2] - b[2])^2)

def closer_than_epsilon(points_list, epsilon):
    return [(p1, p2)
            for p1, p2 in combinations(points_list, r=2)
            if dist(p1, p2) < epsilon]

# df is your pd.DataFrame with three columns: name, x_coor, y_coor
result = closer_than_epsilon(df.values.tolist(), 1)
Riccardo Bucco
  • 13,980
  • 4
  • 22
  • 50
  • 1
    Are you sure this is about the closest pair problem though? The question seems to be about checking _which_ pairs are within some threshold of closeness. You have shown how to calculate the closest two points. I don't think they are the same (I'm not saying that using your answer isn't part of the solution but I don't think it answers the question directly). – FiddleStix Mar 06 '20 at 12:08
  • @FiddleStix You're absolutely right, I'll work on that :) Thanks for pointing it out – Riccardo Bucco Mar 06 '20 at 12:49