Improving the efficiency of iterating through data frames

Question

I am currently trying to iterate through an .csv file of lat/long points, calculate the distances between a pair, then check if one of the points exist in another .csv file. Currently I am putting each .csv file into a pandas data frame. What I have below works but takes too long given the number of items (~19k) in the files. I am unsure if the problem lies in the way I am iterating through or in the way I write to the output file as its my first time using pandas/large data sets as this.

for index1,row1 in iDF.iterrows():
    site1 = getattr(row1, 'site1')
    neighbors = nDF[nDF.column1 == rach1].to_list()
    for index2, row2 in iDF.loc[index1+1:maxRow-1].iterrows():
        site2 = getattr(row2, 'site1')
        dist = geopy.great_circle((getattr(row1, 'lat'), getattr(row1, 'long')),
                                 (getattr(row2, 'lat'), getattr(row2, 'long'))).miles
        if dist < 3:
            if item2 in neighbors:
                neighbor = "Y"
            else:
                neighbor = ""
            oDF = oDF.append({'site1': item1, 'site2': item2, 'distance': dist, 'neighbor': neighbor}, ignore_index=True)
oDF.to_excel(oFileName, 'Sheet1', index=False)

example input data frame

     site1     state   lat       long     misc1   misc2
    san jose     CA   32.3843  -99.25942    0       1
    chicago      IL   25.6449  -98.2424     0       1
    boston       MA   53.344   -92.3434     0       1
  san francsico  CA   32.4932  -97.3450     0       1

example neighbor data frame

 site1         site2
san jose    san francisco

expected output

  site1        site2       distance   neighbor
 san jose   san francisco     50         Y
 san jose     chicago        1000        N
 san jose     boston         1300        N
 chicago      boston         300         N
 chicago     san francisco   1050        N
 boston      san francisco   1350        N

1) What distance function are you using? 2) Your example DataFrames do not seem to correspond. Are you looking for every combination of the cities in your input DataFrame or just cities specified in your output DataFrame. — Jakub, Aug 05 '20 at 04:11
apologies, just updated the data frames to better reflect the expected output. I've modified now to use the great_circle function from geopy — kl9537, Aug 05 '20 at 04:32
Welcome! Apparently you already have a list of neighbors and want to get the distance between them. Don't loop over iDF, instead make a list of pairs from nDF, include the columns with gps locations of each side of the pair, and vectorize the distance function as in this [answer](https://stackoverflow.com/a/38973416/6692898) — RichieV, Aug 05 '20 at 04:48

score 0 · Answer 1 · answered Aug 05 '20 at 05:19

iterrows() is not best paradigm for this requirement.

build a dataframe which is valid combinations of cities
join (merge()) long and lat onto these combinations
calculate distance and whether it's a neighbour

import itertools
import geopy
data = """     site1     state   lat       long     misc1   misc2
    san jose     CA   32.3843  -99.25942    0       1
    chicago      IL   25.6449  -98.2424     0       1
    boston       MA   53.344   -92.3434     0       1
  san francsico  CA   32.4932  -97.3450     0       1"""
a = [[t.strip() for t in re.split("  ",l) if t!=""]  for l in [re.sub("([0-9]?[ ])*(.*)", r"\2", l) for l in data.split("\n")]]

dfs = pd.DataFrame(a[1:], columns=a[0])
# construct a new df that is all combinations of "site1"
dfc = pd.DataFrame([list(c) for c in itertools.combinations(dfs["site1"].tolist(), 2)], 
                   columns=["site1","site2"])
# merge combinations with original data,  so have long/lat of site combinations on one row
dfc = dfc.merge(dfs.loc[:,("site1","long","lat")], left_on="site1", right_on="site1", how="inner")\
    .merge(dfs.loc[:,("site1","long","lat")], left_on="site2", right_on="site1", how="inner")\
    .drop("site1_y", axis=1).rename(columns={"site1_x":"site1"})
# finally calculate distance and whether it's a neighbor
dfc = dfc.assign(distance=lambda dfa: dfa.apply(lambda r: 
            geopy.distance.geodesic((r["lat_x"],r["long_x"]), (r["lat_y"],r["long_y"])).miles, axis=1),
          neighbor=lambda dfa: dfa.distance<500)

print(dfc.to_string(index=False))

output

    site1          site2     long_x    lat_x    long_y    lat_y     distance  neighbor
 san jose        chicago  -99.25942  32.3843  -98.2424  25.6449   468.203941      True
 san jose         boston  -99.25942  32.3843  -92.3434   53.344  1487.138344     False
  chicago         boston   -98.2424  25.6449  -92.3434   53.344  1935.255672     False
 san jose  san francsico  -99.25942  32.3843  -97.3450  32.4932   112.117077      True
  chicago  san francsico   -98.2424  25.6449  -97.3450  32.4932   474.759324      True
   boston  san francsico   -92.3434   53.344  -97.3450  32.4932  1460.626982     False

Jakub · Accepted Answer · 2020-08-05T07:33:36.530

No Loops Solution:

import pandas as pd
from itertools import combinations
import geopy.distance as gd

def geo_dist(lat_x, long_x, lat_y, long_y):
    return round(gd.great_circle((lat_x, long_x), (lat_y, long_y)).miles,1)

# Input DataFrame
data = {'site1': ['san jose','chicago','boston','san francisco'],
        'state': ['CA','IL','MA','CA'],
        'lat':   [32.3843,25.6449,53.344,32.4932],
        'long':  [-99.25942,-98.2424,-92.3434,-97.345],
        'misc1': [0,0,0,0],
        'misc2': [1,1,1,1]}
df = pd.DataFrame(data)

# Neigbors DataFrame
dfn = pd.DataFrame({'site1':['san jose'], 'site2':['san francisco']})
dfn['neighbor'] = 'Y'

# Get Combinations
df1 = pd.DataFrame(list(combinations(df['site1'],2)), columns=['site1','site2'])

# Merge site1
df1 = pd.merge(df1, df[['site1','lat', 'long']], how='left', on='site1')

# Merge site2
df = df.rename(columns={'site1': 'site2'})
df1 = pd.merge(df1, df[['site2','lat', 'long']], how='left', on='site2')

# Calculate distance
df1['distance'] = df1.apply(lambda r : geo_dist(r['lat_x'], r['long_x'], r['lat_y'], r['long_y'] ), axis = 1)

# Drop unused columns
df1 = df1.drop(columns=['lat_x', 'long_x', 'lat_y', 'long_y'])

# Merge with neighbor DataFrame
df1 = pd.merge(df1, dfn, how='left', on=['site1','site2']).fillna('N')

print(df1)

Output:

      site1          site2  distance neighbor
0  san jose        chicago     469.7        N
1  san jose         boston    1488.2        N
2  san jose  san francisco     111.9        Y
3   chicago         boston    1938.0        N
4   chicago  san francisco     476.3        N
5    boston  san francisco    1461.8        N

thank you for the comments. I was able to modify this nicely for my purpose. — kl9537, Aug 05 '20 at 17:49

Improving the efficiency of iterating through data frames

2 Answers2