I have over 1 million rows of Latitude Longitude positions. My goal is to check each of these rows against a data set of about 43000 ZipCodes that have a central Latitude Longitude.
I want to calculate the haversine distance between each row with the large ZipCodes list. I then want to take the closest lat/long and return that or the corresponding zip code to the left most frame (in essence, giving the closest ZipCode to the latitude/longitudes in the large frame.
I have tried several things including vectorized haversine functions and looping through each row, calculating and moving to next but I can't quite get them to work. Given the large size of my data I know that simply looping through each row and calculating won't work. I need a new solution. I think it might involve vectorization.
Here are some sample frames of my data. df is the large frame I am trying to calculate the smallest distance from the zip_list and return the corresponding zip code to the large frame.
df = pd.DataFrame(np.array([[42.801104,-76.827879],[38.187102,-83.433917],
[35.973115,-83.955932]]), columns = ['Lat', 'Long'])
zip_list = pd.DataFrame(np.array([[49544, 42.999561,-85.75371],[49648,
45.000254,-85.3651],[49654, 45.023384,-85.75697],[50265,
41.570916,-93.73568]]), columns = ['ZipCode', 'Latitude', 'Longitude'])
I would like to return the minimum distance zip code to the corresponding row in the df frame.
Any ideas would be great. I am a beginner with vectorization and numpy/pandas.