Distance to Closest Point in Same DF

Question

I have a df with Object ID, Latitude and Longitude. I would like to create two new columns: distance to closest point, and Object ID of closest point.

df[['OBJECT_ID','Lat','Long']].head()

    OBJECT_ID   Lat Long
0   33007002190000.0    47.326963   -103.079835
1   33007007900000.0    47.259770   -103.040797
2   33007008830000.0    47.296953   -103.099424
3   33007012130000.0    47.256700   -103.597082
4   33007013320000.0    46.996013   -103.452384

How can this be done in Python with any library? Also if it helps, my DF contains a few thousand rows.

Assuming the points you are evaluating over are the same points in your dataset, you'll have to enumerate all distances for each object in your list of object_ids. The take the smallest value and store it in something like a map or list. Than you can make a column out of that. Hope this helps — k88, Feb 03 '20 at 07:13
See if this post helps your answer - [Getting distance between two points using lat long](https://stackoverflow.com/questions/19412462/getting-distance-between-two-points-based-on-latitude-longitude/) — Pirate X, Feb 03 '20 at 07:27
Something along these lines but repeated for all columns... pd.DataFrame(distance.cdist(df_dist, df_dist, 'euclidean'))[0].min() — Odisseo, Feb 03 '20 at 07:28

score 1 · Accepted Answer · answered Feb 03 '20 at 08:17

You can use scipy's KDTree for it. It is excellent for spatial distance query.

With your example data, you can do something like

import scipy

coordinates = df[["Lat", "Long"]]
# build kdtree
kdtree = scipy.spatial.cKDTree(coordinates)
# query the same tree with the same coordinates. NOTICE the k=2
distances, indexes = kdtree.query(coordinates, k=2)

# assign it to a new dataframe (NOTICE the index of 1)
new_df = df.assign(ClosestID=df["OBJECT_ID"][indexes[:,1]].array)
new_df = new_df.assign(ClosestDist=distances[:,1])

with the result of

>> new_df

OBJECT_ID   Lat Long    ClosestID   ClosestDist
0   33007002190000.0    47.326963   -103.079835 33007008830000.0    0.035838
1   33007007900000.0    47.259770   -103.040797 33007008830000.0    0.069424
2   33007008830000.0    47.296953   -103.099424 33007002190000.0    0.035838
3   33007012130000.0    47.256700   -103.597082 33007013320000.0    0.298153
4   33007013320000.0    46.996013   -103.452384 33007012130000.0    0.298153

The reason of using k=2 is because the closest distance (when querying with the same coordinates) will always be the same point. i.e.:

>> kdtree.query(coordinates, k=2)

# this is distance
(array([[0.        , 0.03583754],
        [0.        , 0.06942406],
        [0.        , 0.03583754],
        [0.        , 0.29815302],
        [0.        , 0.29815302]]), 
#        ^           ^
#        |           |
#     closest     second-closest

# this is indexes
 array([[0, 2],
        [1, 2],
        [2, 0],
        [3, 4],
        [4, 3]]))

the closest point to each points are itself. Therefore, we ignore the first element and we use index=1 to retrieve the second closest point (i.e. closest point other than itself).

Distance to Closest Point in Same DF

1 Answers1