How to apply the haversine function through loops of two pandas data frames of different sizes?

Question

I have two data frames of different size of A(236 x 4) and B(967 x 4). I want to iterate through rows of both data frames with a nested loop that takes the input of both FIPS columns and if they're equal I want them to spit out a haversine distance.

Header of one data frame looks like this,

       AddressClinic             CountyClinicFIPS       Lat               Lon\
 0  9137 ST.RT. 136, WEST UNION, OH, 45693   1.0   -83.60553     38.809795
 1  940 N. Cable Rd., Lima, OH, 45805        3.0   -84.14703     40.755620
 2  934 Center St. Ste E, Ashland, OH, 44805 5.0   -82.31091     40.859657  
 3  934 Center ST Ste. E, Ashland, OH, 44805 5.0   -82.31091     40.859657 
 4  3225 Lake Avenue, Ashtabula, OH, 44004   7.0   -80.79042     41.876133

The other data frame is the exact same format with different address information in the same format.

                InputAddr            CountyRetailerFIPS Lat         Lon\
   0     16782 ST RT 125, WEST UNION, OH, 45693  1 -83.41653   38.771553
   1      156 NORTH MAIN ST, PEEBLES, OH, 45660  1 -83.40529   38.949960  
   2       18811 SR 136 , WINCHESTER, OH, 45697  1 -83.65418   38.937350 
   3  2100 HARDING HIGHWAY #12, LIMA, OH, 45804  3 -84.06680   40.730656  
   4         1102 ELIDA AVE, DELPHOS, OH, 45833  3 -84.32622   40.842110

Code I've been attempting is an almagmation that looks like this

import pandas as pd
import numpy as np
from numpy import cos, sin, arcsin, sqrt
from math import radians

dataclinic=pd.read_csv(r"C:\Users\Jack\Documents\Schoolwork\WICResearch\MyDocs\getdistanceclinic.csv")
dataretailer=pd.read_csv(r"C:\Users\Jack\Documents\Schoolwork\WICResearch\MyDocs\getdistanceretailer.csv")
datadistance = pd.DataFrame([])

def haversine(row):
    lon1 = row1['Lon']
    lat1 = row1['Lat']
    lon2 = row2['Lon']
    lat2 = row2['Lat']
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * arcsin(sqrt(a)) 
    km = 6367 * c
    return km

for index1, row1 in dataclinic.iterrows():
    for index2, row2 in dataretailer.iterrows():
        if row1['CountyClinicFIPS'] == row2['CountyRetailerFIPS']:
            datadistance = datadistance.apply(lambda row: haversine(row), axis=1)
        else: 
            continue


print(datadistance)

And unsurprisingly it doesn't work. It does something but not much. It returns datadistance as a series object of (0,) index1,2 as int 235,966 row1,2 as Series of values (4,) both which are the elements of the first row of the data frames.

The expected outcome is to have datadistance be filled with all the cases where ClinicFips = RetailerFips and the distance between the two computed and listed to make averaging between counties easier.

Perhaps I am going about this the wrong way. Any advice is greatly appreciated. Thanks.

piedpiper · Answer 1 · 2019-11-13T21:28:14.310

0

I think you should join the 2 dataframes first. This will give you a dataframe with rows where the CountyClinicFIPS = CountyRetailerFIPS. The code would look something like:

df = pd.merge(dataclinic, dataretailter, left_on='CountyClinicFIPS', right_on='CountyRetailerFIPS', suffixes=['_clinic','_retailer'])

Then use a vectorized implementation of haversine like the one found in this answer - Fast Haversine Approximation (Python/Pandas). This will be faster than iterating through the dataframe row by row and using an apply function.

The haversine code would look something like this (once you've imported the haversine_np function from the link above):

df['distance'] = haversine_np(df['lon_clinic'],df['lat_clinic'],df['lon_retailer'],df['lat_retailer'])

Hope this helps!

edited Nov 13 '19 at 21:28

answered Nov 13 '19 at 21:15

piedpiper

328
2
13

Thanks, but I don't think I did a good job clarifying of what I meant. I want the value of the rows of the FIPS to be equal, not the rows themselves. Maybe I am missing something, but I don't see how the vectorization would help with that either as I need the distances from one point to all other points in with the same FIPS identifier if that makes sense. – JDubs Nov 14 '19 at 20:32
Hmm that is how I interpreted your question originally. Ex. lets look at your first row with CountyClinicFIPS = 1. Say there are 50 rows in dataretailter with CountyRetailerFIPS = 1. Then you want the distance from the county clinic to all 50 retailers right? – piedpiper Nov 14 '19 at 22:08
Precisely. Which is why I thought a nested loop would be the way to go. But maybe that is too many observations to do with that method. – JDubs Nov 14 '19 at 23:27
Right, so a merge would accomplish exactly what you want to do and it'd be far quicker than nested loops. Here's the docs for the merge function: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html Unless I'm misunderstanding something? – piedpiper Nov 15 '19 at 22:49
Can you explain further how merge would help me do that? Would it create all the combinations with geographic points to run the function through? What I need is to have all combinations of Clinics and Retailers in a County with the same FIPS designation so I have a distance from every Clinic to every Retailer in the county. Sorry if this is confusing, I've struggled explaining my problem to everyone so far. – JDubs Nov 15 '19 at 23:09
Or another way to think of it I guess is I need a list product of all the retailer and clinics but only need the product of those with a matching FIPS identifier if that helps at all. – JDubs Nov 15 '19 at 23:20
Have you read the merge docs I linked? The concept of pd.merge is basically an SQL join - where you want to join 2 tables together on the basis of some set of common columns - in your case, joining the clinic table with the retailer table on FIPS id so that for each clinic, you get every retailer with a matching FIPS. – piedpiper Nov 15 '19 at 23:36
I did read the documents but I'm not that great at programming and I didn't take away that would happen from that documentation. I think it worked. Time to run through it. Thank you so much for your patience in helping me with this. – JDubs Nov 16 '19 at 00:46
Do you know why I might be getting a key error for 'lon_clinic'? I can't seem to figure that part out. – JDubs Nov 19 '19 at 00:23
try `df['Lon_clinic']` instead? the `suffixes=['_clinic','_retailer']` argument in `pd.merge` appends "_clinic" to columns of the first dataframe where the column name is the same as the second dataframe. Similarly, "_retailer" is applied to columns of the second dataframe where the column name is the same as in the first dataframe. – piedpiper Nov 19 '19 at 00:27

How to apply the haversine function through loops of two pandas data frames of different sizes?

1 Answers1