14

Input:

point = (lat, long)
places = [(lat1, long1), (lat2, long2), ..., (latN, longN)]
count = L

Output: neighbors = subset of places close to the point. (len(neighbors)=L)

Question: Can I use kd-tree for quick nearest-neighbors lookup for points with latitude and longitude? (For example, implementation in scipy)

Is it necessary to transform the geographical coordinates (latitude and longitude) of the point in the coordinates x,y?

Is it the best way to solve this?

Andrei
  • 1,313
  • 4
  • 18
  • 35

3 Answers3

5

I honestly don't know if using a kd-tree would work correctly, but my hunch says it would be inaccurate.

I think you need to use something like greater circle distance to get accurate distances.


from math import radians, cos, sin, asin, sqrt, degrees, atan2

def validate_point(p):
    lat, lon = p
    assert -90 <= lat <= 90, "bad latitude"
    assert -180 <= lon <= 180, "bad longitude"

# original formula from  http://www.movable-type.co.uk/scripts/latlong.html
def distance_haversine(p1, p2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    Haversine
    formula: 
        a = sin²(Δφ/2) + cos φ1 ⋅ cos φ2 ⋅ sin²(Δλ/2)
                        _   ____
        c = 2 ⋅ atan2( √a, √(1−a) )
        d = R ⋅ c

    where   φ is latitude, λ is longitude, R is earth’s radius (mean radius = 6,371km);
            note that angles need to be in radians to pass to trig functions!
    """
    lat1, lon1 = p1
    lat2, lon2 = p2
    for p in [p1, p2]:
        validate_point(p)

    R = 6371 # km - earths's radius

    # convert decimal degrees to radians 
    lat1, lon1, lat2, lon2 = map(radians, [lat1, lon1, lat2, lon2])

    # haversine formula 
    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) # 2 * atan2(sqrt(a), sqrt(1-a))
    d = R * c
    return d
Marcel Wilson
  • 3,842
  • 1
  • 26
  • 55
  • where to find the function `validate_point` used in this code ? I assume it checks that latitude and longitude are between -90 and 90 ? – fmarm Jun 20 '19 at 14:20
  • 2
    This answer is kind of incomplete and does not answer the OP's question. The function `distance_haversine()` calculates the distance in km between two points given in lat/lon, but it does not answer the question how to find the nearest neighbors using this metric. – lumbric Jul 08 '19 at 17:10
  • @lumbric technically you are correct. I provided the means on how to calculate the distances, as part of the question was asking if geography points need to be converted. Which ultimately the answer is you don't need to convert them if using distance_haversine. You simply find the distance to each set of points and pick the smallest. – Marcel Wilson Jul 08 '19 at 18:23
  • 1
    @MarcelWilson ah yes, you are right, your answer can be easily used to calculate all pairwise distances and then take the minimum. This is possible but not optimal. For large values (let's say >10 000 points) this will use a lot of memory and take a lot of time. It requires O(n^2) time and memory, while the optimal solution should be O(n*log(n)) in time e.g. using some kind of index like k-trees. Since the OP asked about k-trees, I assumed he was interested in an optimal solution. – lumbric Jul 09 '19 at 08:49
  • @lumbric While it would be more efficient, the answers would not always be correct. https://cs.stackexchange.com/a/57619 So when it comes to finding nearest neighbor it's better to limit the number of points to a bounding region around your origin point for exactly the reason you state. – Marcel Wilson Jul 09 '19 at 14:36
  • 1
    @MarcelWilson Ah yes, of course. Trusting on the Euclidean metric is risky if distances are large. I think it should be possible to find algorithms which solve the question in O(n*log(n)) using the Haversine metric. I'm not sure about KDTree, but [BallTree](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.BallTree.html#sklearn.neighbors.BallTree) in sklearn supports the Haversine metric (I'm not sure if there are any pitfalls). – lumbric Jul 09 '19 at 15:35
  • If I recall correctly the risk is more pronounced on the top and bottom of the globe rather than at far distances. I'll have to check out BallTree; sounds magical. – Marcel Wilson Jul 10 '19 at 13:46
5

scikit-learn provides a BallTree class that supports the Haversine metric. See also this SO question.

Florian Brucker
  • 9,621
  • 3
  • 48
  • 81
1

I think you are trying to solve the k Nearest Neighbor problem.

Since your data set lies in 2D, then a kd-tree would work nicely, in general, I do not know about spicy.

However, if your points start living in a higher dimension, then kd-tree will not be a smart choice.

Community
  • 1
  • 1
gsamaras
  • 71,951
  • 46
  • 188
  • 305
  • 3
    Data is not given in 2D, points are in lat/lon which cannot be converted accurately to 2D coordinates. One can calculate the distance between points using the Haversine formula but it is not _identical_ to points in 2D (I don't know if kd-tree is applicable or not). – lumbric Jul 08 '19 at 17:13