I am currently trying to iterate through an .csv file of lat/long points, calculate the distances between a pair, then check if one of the points exist in another .csv file. Currently I am putting each .csv file into a pandas data frame. What I have below works but takes too long given the number of items (~19k) in the files. I am unsure if the problem lies in the way I am iterating through or in the way I write to the output file as its my first time using pandas/large data sets as this.
for index1,row1 in iDF.iterrows():
site1 = getattr(row1, 'site1')
neighbors = nDF[nDF.column1 == rach1].to_list()
for index2, row2 in iDF.loc[index1+1:maxRow-1].iterrows():
site2 = getattr(row2, 'site1')
dist = geopy.great_circle((getattr(row1, 'lat'), getattr(row1, 'long')),
(getattr(row2, 'lat'), getattr(row2, 'long'))).miles
if dist < 3:
if item2 in neighbors:
neighbor = "Y"
else:
neighbor = ""
oDF = oDF.append({'site1': item1, 'site2': item2, 'distance': dist, 'neighbor': neighbor}, ignore_index=True)
oDF.to_excel(oFileName, 'Sheet1', index=False)
example input data frame
site1 state lat long misc1 misc2
san jose CA 32.3843 -99.25942 0 1
chicago IL 25.6449 -98.2424 0 1
boston MA 53.344 -92.3434 0 1
san francsico CA 32.4932 -97.3450 0 1
example neighbor data frame
site1 site2
san jose san francisco
expected output
site1 site2 distance neighbor
san jose san francisco 50 Y
san jose chicago 1000 N
san jose boston 1300 N
chicago boston 300 N
chicago san francisco 1050 N
boston san francisco 1350 N