Pandas merge two dataframes on nearest distance between coordinates

Question

I have two datasets that I need to join together based on the nearest distance between coordinates (latitude and longitude)

df1 contains information about air pollution measurements and looks like this:

station_air	latitude	longitude	measurement
1	53.23	4.81	60
2	51.93	5.82	140
...	...	...	...

df2 contains data collected by weather stations and looks like this:

station_air	latitude	longitude	measurement
310	52.46	4.51	12.3
290	51.48	5.82	15.6
...	...	...	...

There are other columns such as time of measurement, date, month etc etc but are not relevant for this question. All stations in both df1 and df2 are unique, so we can't merge on their ID.

Ideally, I want the following table:

station_air	latitude	longitude	measurement	temp
...	...	...	...	...
...	...	...	...	...
..	...	...	...	...

It comes down to this: for each row (station_air) in df1, find in df2 the weather station with the closest distance to that station_air, and add temp.

I googled around and found this: Haversine Formula in Python (Bearing and Distance between two GPS points)

But this takes too long since my df1 is around 1,5mil rows

Can anyone help me out?

How long is df2? and are you iterating over df2 entirely or a filtered df2? Also, this sounds like an opportunity to use multiprocessing or threading. or https://dask.org/ — Jonathan Leon, Jun 06 '21 at 03:00
df2 is around 39k rows. Ideally I want to iterate over the entire data. I'll look into multiprocessing, thank you — HenkieTaylor, Jun 06 '21 at 13:09
Further reading. Also if you want to post a bit more if your data for both data frames you may get some other advice or help. https://stackoverflow.com/questions/48887912/find-minimum-distance-between-points-of-two-lists-in-python — Jonathan Leon, Jun 06 '21 at 14:30

Pandas merge two dataframes on nearest distance between coordinates

0 Answers0