Calculating Haversine Distance for large data set of Latitude/Longitudes with another data frame then returning corresponding data with Python

Question

I have over 1 million rows of Latitude Longitude positions. My goal is to check each of these rows against a data set of about 43000 ZipCodes that have a central Latitude Longitude.

I want to calculate the haversine distance between each row with the large ZipCodes list. I then want to take the closest lat/long and return that or the corresponding zip code to the left most frame (in essence, giving the closest ZipCode to the latitude/longitudes in the large frame.

I have tried several things including vectorized haversine functions and looping through each row, calculating and moving to next but I can't quite get them to work. Given the large size of my data I know that simply looping through each row and calculating won't work. I need a new solution. I think it might involve vectorization.

Here are some sample frames of my data. df is the large frame I am trying to calculate the smallest distance from the zip_list and return the corresponding zip code to the large frame.

    df = pd.DataFrame(np.array([[42.801104,-76.827879],[38.187102,-83.433917], 
   [35.973115,-83.955932]]), columns = ['Lat', 'Long'])

    zip_list = pd.DataFrame(np.array([[49544, 42.999561,-85.75371],[49648, 
    45.000254,-85.3651],[49654, 45.023384,-85.75697],[50265, 
    41.570916,-93.73568]]), columns = ['ZipCode', 'Latitude', 'Longitude'])

I would like to return the minimum distance zip code to the corresponding row in the df frame.

Any ideas would be great. I am a beginner with vectorization and numpy/pandas.

Possible duplicate of [Fast Haversine Approximation (Python/Pandas)](https://stackoverflow.com/questions/29545704/fast-haversine-approximation-python-pandas) — Joe, May 17 '19 at 16:57
Possible duplicate of [Vectorizing Haversine distance calculation in Python](https://stackoverflow.com/questions/34502254/vectorizing-haversine-distance-calculation-in-python) — m13op22, May 17 '19 at 16:57
@Joe I've seen these and I still can't quite figure out how to compare one row on my left frame to another frame of 40000 observations and return the minimum result set as a new entry on the left. — Coldchain9, May 17 '19 at 17:05
Then you might need a computer with a lot of RAM and some numpy magic. Maybe broadcasting might work. You add a new dimension to the array and apply the solution of the other thread to all "slices" at once. Then use numpy.min along an axis to reduce the array. I won't write it for you, but it is possible :) — Joe, May 17 '19 at 17:32
Basically dimension one and two hold your df and the third is the zip-codes repeated. — Joe, May 17 '19 at 17:34
You need https://docs.scipy.org/doc/numpy/reference/generated/numpy.repeat.html and https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html and the vectorized Haversine and https://docs.scipy.org/doc/numpy/reference/generated/numpy.amin.html#numpy.amin with the axis keyword — Joe, May 17 '19 at 17:38
@Joe This is great but how exactly do I add this extra dimension to hold my df? I'm not very well versed in any of this operation so some more tips would be useful.. — Coldchain9, May 17 '19 at 18:12
at best use https://docs.scipy.org/doc/numpy/reference/generated/numpy.expand_dims.html or the shortcuts like x[:,np.newaxis] mentioned there. Then repeat along that axis. Start with a simple example. Create a 3d array with zeros, np.zeros((5, 5, 5)), then create a 3d array with random numbers np.random.rand(5,5) and fill that into the 3d array with a for-loop. Slow, but easier to see what's going on. — Joe, May 17 '19 at 19:04
Use something like my3d[:,:,0] = my2d, then use numpy functions to do the same, np.repeat, np.tile, When you are trying to do something like you are planning, try using a simple example first. It will be the same for millions and thousands of rows :) — Joe, May 17 '19 at 19:07
Took me years :) https://docs.scipy.org/doc/numpy/user/quickstart.html and http://scipy-lectures.org/intro/numpy/operations.html — Joe, May 17 '19 at 20:47

Calculating Haversine Distance for large data set of Latitude/Longitudes with another data frame then returning corresponding data with Python

0 Answers0