I am using pandas to calculate the distance between schools using lat and lon in two datasets. I am using the following haversine function:
def haversine(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
r = 6371 # Radius of earth in kilometers. Use 3956 for miles
return c * r
The dataset looks like:
gr_offer, sch_code, lat, lon, distance_to_sec_school_km,nearest_sec_school_code
0 G.1-4 S0306020712 11.09990 37.987301 NaN NaN
1 G.1-4 S0401140392 7.15509 38.586300 NaN NaN
2 G.1-4 S0406150452 9.40269 41.964401 NaN NaN
The second dataset looks similar but offers different grade levels. I am now calculating the distance of the school in dataset1 to dataset2, populating the the rows in the columns from the first dataset: - 1) distance_to_sec_school_km and 2) nearest_sec_school_code. I then read the output of the first dataset as my result. My code seems to work but it just take more than one hour to run. I want to improve the efficiency.
Any advise here would be very much appreciated!
See below the function:
def calculate_distance(df1,df2):
import time
startTime = time.time()
i=0
count = 0
# two variables to store the results from the distance function.
nearest_gps = 0
nearest_code = 0
# loop to calculate distance between primary and secondary
while i < len(df1):
if count < len(df2):
distance = haversine(df1['lon'].iloc[i], df1['lat'].iloc[i],
df2['lon'].iloc[count], df2['lat'].iloc[count])
if nearest_gps == 0 and count < len(df2): # i.e. first iteration.
nearest_gps = distance
nearest_code = df2['sch_code'].iloc[count]
count+=1
elif distance < nearest_gps and count < len(df2): #shortest distance replaced
nearest_gps = distance
nearest_code = df2['sch_code'].iloc[count]
count+=1
else:
count+=1
else:
df1['distance_to_sec_school_km'].iloc[i] = nearest_gps
df1['nearest_sec_school_code'].iloc[i] = nearest_code
i += 1
count = 0
nearest_gps = 0
nearest_code = 0
executionTime = (time.time() - startTime)
print('Execution time in seconds: ' + str(executionTime))
df1.to_csv('edited_csv.csv', index=False, encoding = 'utf-8')