Improve efficiency of haversine function being applied to all rows in two datasets (Pandas)

Question

I am using pandas to calculate the distance between schools using lat and lon in two datasets. I am using the following haversine function:

def haversine(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points 
on the earth (specified in decimal degrees)
"""
# convert decimal degrees to radians 
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

# haversine formula 
dlon = lon2 - lon1 
dlat = lat2 - lat1 
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a)) 
r = 6371 # Radius of earth in kilometers. Use 3956 for miles
return c * r

The dataset looks like:

gr_offer, sch_code, lat, lon, distance_to_sec_school_km,nearest_sec_school_code
0   G.1-4   S0306020712 11.09990    37.987301   NaN NaN
1   G.1-4   S0401140392 7.15509 38.586300   NaN NaN
2   G.1-4   S0406150452 9.40269 41.964401   NaN NaN

The second dataset looks similar but offers different grade levels. I am now calculating the distance of the school in dataset1 to dataset2, populating the the rows in the columns from the first dataset: - 1) distance_to_sec_school_km and 2) nearest_sec_school_code. I then read the output of the first dataset as my result. My code seems to work but it just take more than one hour to run. I want to improve the efficiency.

Any advise here would be very much appreciated!

See below the function:

def calculate_distance(df1,df2):
import time
startTime = time.time()

i=0
count = 0
# two variables to store the results from the distance function. 
nearest_gps = 0
nearest_code = 0

# loop to calculate distance between primary and secondary
while i < len(df1):
    if count < len(df2):
        distance = haversine(df1['lon'].iloc[i], df1['lat'].iloc[i],
                             df2['lon'].iloc[count], df2['lat'].iloc[count])
        if nearest_gps == 0 and count < len(df2): # i.e. first iteration. 
            nearest_gps = distance
            nearest_code = df2['sch_code'].iloc[count]
            count+=1
        elif distance < nearest_gps and count < len(df2): #shortest distance replaced
            nearest_gps = distance
            nearest_code = df2['sch_code'].iloc[count]
            count+=1
        else:        
            count+=1
    else:
        df1['distance_to_sec_school_km'].iloc[i] = nearest_gps
        df1['nearest_sec_school_code'].iloc[i] = nearest_code
        i += 1
        count = 0
        nearest_gps = 0
        nearest_code = 0

executionTime = (time.time() - startTime)
print('Execution time in seconds: ' + str(executionTime))
df1.to_csv('edited_csv.csv', index=False, encoding = 'utf-8')

Try vectorizing your operations(sum, sub, abs,..), Loops are really inefficient. See [this](https://stackoverflow.com/questions/27575854/vectorizing-a-function-in-pandas) — santiagoNublado, Aug 12 '20 at 14:29
Thanks, not worked with vectors yet but appreciate the link! — dmin, Aug 27 '20 at 06:49

Improve efficiency of haversine function being applied to all rows in two datasets (Pandas)

0 Answers0