1

I have a df with Object ID, Latitude and Longitude. I would like to create two new columns: distance to closest point, and Object ID of closest point.

df[['OBJECT_ID','Lat','Long']].head()

    OBJECT_ID   Lat Long
0   33007002190000.0    47.326963   -103.079835
1   33007007900000.0    47.259770   -103.040797
2   33007008830000.0    47.296953   -103.099424
3   33007012130000.0    47.256700   -103.597082
4   33007013320000.0    46.996013   -103.452384

How can this be done in Python with any library? Also if it helps, my DF contains a few thousand rows.

Odisseo
  • 747
  • 1
  • 13
  • 32
  • Assuming the points you are evaluating over are the same points in your dataset, you'll have to enumerate all distances for each object in your list of object_ids. The take the smallest value and store it in something like a map or list. Than you can make a column out of that. Hope this helps – k88 Feb 03 '20 at 07:13
  • See if this post helps your answer - [Getting distance between two points using lat long](https://stackoverflow.com/questions/19412462/getting-distance-between-two-points-based-on-latitude-longitude/) – Pirate X Feb 03 '20 at 07:27
  • Something along these lines but repeated for all columns... pd.DataFrame(distance.cdist(df_dist, df_dist, 'euclidean'))[0].min() – Odisseo Feb 03 '20 at 07:28

1 Answers1

1

You can use scipy's KDTree for it. It is excellent for spatial distance query.

With your example data, you can do something like

import scipy

coordinates = df[["Lat", "Long"]]
# build kdtree
kdtree = scipy.spatial.cKDTree(coordinates)
# query the same tree with the same coordinates. NOTICE the k=2
distances, indexes = kdtree.query(coordinates, k=2)

# assign it to a new dataframe (NOTICE the index of 1)
new_df = df.assign(ClosestID=df["OBJECT_ID"][indexes[:,1]].array)
new_df = new_df.assign(ClosestDist=distances[:,1])

with the result of

>> new_df

OBJECT_ID   Lat Long    ClosestID   ClosestDist
0   33007002190000.0    47.326963   -103.079835 33007008830000.0    0.035838
1   33007007900000.0    47.259770   -103.040797 33007008830000.0    0.069424
2   33007008830000.0    47.296953   -103.099424 33007002190000.0    0.035838
3   33007012130000.0    47.256700   -103.597082 33007013320000.0    0.298153
4   33007013320000.0    46.996013   -103.452384 33007012130000.0    0.298153

The reason of using k=2 is because the closest distance (when querying with the same coordinates) will always be the same point. i.e.:

>> kdtree.query(coordinates, k=2)

# this is distance
(array([[0.        , 0.03583754],
        [0.        , 0.06942406],
        [0.        , 0.03583754],
        [0.        , 0.29815302],
        [0.        , 0.29815302]]), 
#        ^           ^
#        |           |
#     closest     second-closest

# this is indexes
 array([[0, 2],
        [1, 2],
        [2, 0],
        [3, 4],
        [4, 3]]))

the closest point to each points are itself. Therefore, we ignore the first element and we use index=1 to retrieve the second closest point (i.e. closest point other than itself).

Tin Lai
  • 440
  • 3
  • 8