How to improve performance - Merge two dataframes by closest geodetic distance

Question

I have two dataframes, one radar which represents data on an equispaced grid with columns for longitude, latitude and height value, and one ice that has some information related to satellite observations, including the latitude and longitude of the observation. I want to merge the two so I can get ice with the 'height' column from radar, based on the geodetic distance point from each ice row to the closest radar point.

I'm currently doing it like this:

from geopy.distance import geodesic
import pandas as pd
def get_distance(out):
    global radar
    dists = radar['latlon'].apply(lambda x: geodesic(out['latlon'], x).km)
    out['dist to radar']=min(dists)
    out['rate_yr_radar']=radar.loc[dists.idxmin()]['rate_yr_radar']
    return out

ICEvsRadar=ice.apply(get_distance, axis=1)

But it's very slow, I have around 200 points in my ice dataframe and around 50000 on the radar one. Is a slow performance to be expected based on the computational cost of calculating each distance, or could I improve something in how I apply the function?

edit: uploaded the example data on https://wetransfer.com/downloads/284036652e682a3e665994d360a3068920221203230651/5842f2

The code takes around 25 minutes to run, ice has lon, lat and latlon fields and is 180 rows long, and radar has 50000 rows with lon, lat, latlon and rate_yr_radar fields

Edit: Used the help from the comment by Atanas, ended up solving it like this:

import pandas as pd
import numpy as np
from sklearn.neighbors import BallTree

#building tree
Tree = BallTree(np.deg2rad(radar[['lat', 'lon']].values), metric='haversine')

#querying the nearest neighbour
distance, index = Tree.query(np.deg2rad(ice.loc[:, ["lat","lon"]]))

#getting relevant data from radar to merge with ice
reduced_radar = radar.loc[np.concatenate(index), ["rate_yr_radar"]]
reduced_radar['dist to radar']=np.concatenate(distance)*6371 #get correct distance in km
reduced_radar = reduced_radar.reset_index().rename({"index": "index_from_radar"}, axis=1)

#joining data
ice = ice.join(reduced_radar)

It went from a 30 minute runtime to 60 milliseconds!

can you put examples of the 2 data frames you have. https://stackoverflow.com/help/minimal-reproducible-example — Atanas Atanasov, Dec 03 '22 at 22:48

score 1 · Accepted Answer · edited Mar 09 '23 at 20:55

1

This code takes less than a second on my machine. Probably not working around equator/greenwich

import pandas as pd
import numpy as np
from scipy.spatial import KDTree

#reading data
radar = pd.read_csv("radar.csv")
ice = pd.read_csv("ice.csv")

#extracting points data
pts = np.array(radar.loc[:, ["lon", "lat"]])

#building tree
Tree = KDTree(pts)

#querying the nearest neighbour
distance, index = Tree.query(ice.loc[:, ["lon", "lat"]])

#getting relevant data from ice
reduced_radar = radar.loc[index, ["rate_yr_radar"]]
reduced_radar = reduced_radar.reset_index().rename({"index": "index_from_radar"}, axis=1)

#joining data
ice = ice.join(reduced_radar)

Alternatively one could look at https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoDataFrame.sjoin_nearest.html

edited Mar 09 '23 at 20:55

marc_s

732,580
175
1,330
1,459

answered Dec 04 '22 at 13:38

Atanas Atanasov

359
1
10

Thank you! It's great to see that it performs so well. However, is it using geodesic distance, or a planar geometry in longitude and latitude? I now think it could be possible to combine the two: use planar geometry to get the closest N points, and calculate geodesic distance only on those points to get the closest one. But I'm yet to realize how to set N generically to make sure I'm including the actual closest point. – Feva Dec 04 '22 at 17:49
you are right. It is planar distance. Intuitively I do not think your approach will work. As 170 degrees west and and 170 degrees east planarly are very far but geodesically are close. If you still want to try it there is scipy.spatial.KDTree.query_ball_point – Atanas Atanasov Dec 05 '22 at 22:25
1

however I just found out there is this answer here ->https://stackoverflow.com/questions/10549402/kdtree-for-longitude-latitude This should generally solve your problem just use the ball tree instead of kdtree – Atanas Atanasov Dec 05 '22 at 22:30

How to improve performance - Merge two dataframes by closest geodetic distance

1 Answers1