2

I have two datasets that I need to join together based on the nearest distance between coordinates (latitude and longitude)

df1 contains information about air pollution measurements and looks like this:

station_air latitude longitude measurement
1 53.23 4.81 60
2 51.93 5.82 140
... ... ... ...

df2 contains data collected by weather stations and looks like this:

station_air latitude longitude measurement
310 52.46 4.51 12.3
290 51.48 5.82 15.6
... ... ... ...

There are other columns such as time of measurement, date, month etc etc but are not relevant for this question. All stations in both df1 and df2 are unique, so we can't merge on their ID.

Ideally, I want the following table:

station_air latitude longitude measurement temp
... ... ... ... ...
... ... ... ... ...
.. ... ... ... ...

It comes down to this: for each row (station_air) in df1, find in df2 the weather station with the closest distance to that station_air, and add temp.

I googled around and found this: Haversine Formula in Python (Bearing and Distance between two GPS points)

But this takes too long since my df1 is around 1,5mil rows

Can anyone help me out?

  • How long is df2? and are you iterating over df2 entirely or a filtered df2? Also, this sounds like an opportunity to use multiprocessing or threading. or https://dask.org/ – Jonathan Leon Jun 06 '21 at 03:00
  • df2 is around 39k rows. Ideally I want to iterate over the entire data. I'll look into multiprocessing, thank you – HenkieTaylor Jun 06 '21 at 13:09
  • Further reading. Also if you want to post a bit more if your data for both data frames you may get some other advice or help. https://stackoverflow.com/questions/48887912/find-minimum-distance-between-points-of-two-lists-in-python – Jonathan Leon Jun 06 '21 at 14:30

0 Answers0