I have two datasets that I need to join together based on the nearest distance between coordinates (latitude and longitude)
df1
contains information about air pollution measurements and looks like this:
station_air | latitude | longitude | measurement |
---|---|---|---|
1 | 53.23 | 4.81 | 60 |
2 | 51.93 | 5.82 | 140 |
... | ... | ... | ... |
df2
contains data collected by weather stations and looks like this:
station_air | latitude | longitude | measurement |
---|---|---|---|
310 | 52.46 | 4.51 | 12.3 |
290 | 51.48 | 5.82 | 15.6 |
... | ... | ... | ... |
There are other columns such as time of measurement, date, month etc etc but are not relevant for this question. All stations in both df1
and df2
are unique, so we can't merge on their ID.
Ideally, I want the following table:
station_air | latitude | longitude | measurement | temp |
---|---|---|---|---|
... | ... | ... | ... | ... |
... | ... | ... | ... | ... |
.. | ... | ... | ... | ... |
It comes down to this: for each row (station_air
) in df1
, find in df2
the weather station with the closest distance to that station_air
, and add temp
.
I googled around and found this: Haversine Formula in Python (Bearing and Distance between two GPS points)
But this takes too long since my df1
is around 1,5mil rows
Can anyone help me out?