I have two HUGE Pandas dataframes with location based values, and I need to update df1['count'] with the number of records from df2 which are less than 1000m from each point in df1.
Here's an example of my data, imported into Pandas
df1 = lat long valA count
0 123.456 986.54 1 0
1 223.456 886.54 2 0
2 323.456 786.54 3 0
3 423.456 686.54 2 0
4 523.456 586.54 1 0
df2 = lat long valB
0 123.456 986.54 1
1 223.456 886.54 2
2 323.456 786.54 3
3 423.456 686.54 2
4 523.456 586.54 1
In reality, df1 has ~10 million rows, and df2 has ~1 million
I created a working nested FOR loop using the Pandas DF.itertuples() method, which works fine for smaller test data sets (df1=1k Rows & df2=100 Rows takes about an hour to complete), but the full data set is exponentially larger, and will take years to complete based on my calculations. Here's my working code...
import pandas as pd
import geopy.distance as gpd
file1 = 'C:\\path\\file1.csv'
file2 = 'C:\\path\\file2.csv'
df1 = pd.read_csv(file1)
df2 = pd.read_csv(file2)
df1.sort_values(['long', 'lat']), inplace=True)
df2.sort_values(['long', 'lat']), inplace=True)
for irow in df1.itertuples():
count = 0
indexLst = []
Location1 = (irow[1], irow[2])
for jrow in df2.itertuples():
Location2 = (jrow[1], jrow[2])
if gpd.distance(Location1, Location2).kilometers < 1:
count += 1
indexLst.append(jrow[0])
if count > 0: #only update DF if a match is found
df1.at[irow[0],'count'] = (count)
df2.drop(indexLst, inplace=True) #drop rows already counted from df2 to speed up next iteration
#save updated df1 to new csv file
outFileName = 'combined.csv'
df1.to_csv(outFileName, sep=',', index=False)
Each point in df2 need only be counted once since points in df1 are evenly spaced. To that end I added a drop statment to remove rows from df2 once they have been counted in hopes of improving iteration time. I also tried creating a merge/join statement initially, instead of the nested loops, but was unsuccessful.
At this stage, any help on improving efficiency here is greatly appreciated!
Edit: Goal is to update the 'count' column in df1 (as shown below) with the number of points from df2 which are <1km, and output to a new file.
df1 = lat long valA count
0 123.456 986.54 1 3
1 223.456 886.54 2 1
2 323.456 786.54 3 9
3 423.456 686.54 2 2
4 523.456 586.54 1 5