So I have the following dataframes (simplified)
df1 = propslat prosplong type
50 45 prosp1
34 -25 prosp2
df2 = complat complong type
29 58 competitor1
68 34 competitor2
I want to do the following - run a distance calculation for each individual prospects (740k prospects in total) between that prospect and every competitor so theoretically the output would look like the following:
df3 = d_p(x)_to_c1 d_p(x)_to_c2 d_p(x)_to_c3
234.34 895.34 324.5
where every row of the output is a new prospect.
My current code is the following:
prospectsarray=[]
prosparr = []
for i, row in prospcords.iterrows():
lat1 = row['prosplat']
lon2 = row['prosplong']
coords= [lat1,lon2]
distancearr2 = []
for x, row2 in compcords.iterrows():
lat2 = row2['complat']
lon2 = row2['complong']
coords2 = [lat2,lon2]
distance = geopy.distance.distance(coords, coords2).miles
if distance > 300:
distance = 0
distancearr2.append(distance)
prosparr.append(distancearr2)
prospectsarray.extend(prosparr)
dfprosp = pd.DataFrame(prospectsarray)
While this accomplished my goal, it is horrendously slow.
I have tried the following optimization, but the output is not iterating and still I am using an iterrows which is what I was trying to avoid.
competitorlist = []
def distancecalc(df):
distance_list = []
for i in range(0, len(prospcords)):
coords2 = [prospcords.iloc[i]['prosplat'],prospcords.iloc[i]['prosplong']]
d = geopy.distance.distance(coords1,coords2).miles
print(d)
if d>300:
d=0
distance_list.append(d)
competitorlist.append(distance_list)
for x, row2 in compcords.iterrows():
lat2 = row2['complat']
lon2 = row2['complong']
coords1 = [lat2,lon2]
distancecalc(prospcords)
print(distance_list)