I have two dataframes(first - about 30K rows, second - about 60M rows) and I need to compare home addresses between them and choose the best match:
df1 = pd.DataFrame(data={
'ID': [1, 2],
'address': ['14985 Jesses Trl Kenai AK', '589 Silver Rock Trl Castle Rock CO']})
df2 = pd.DataFrame(data={
'ID': [1, 2],
'address': ['14985 Jesses Trl Ninilchik AK', '589 Silver Rock Trail Castle Rock CO']})
I decided this question using this script:
scores = []
indices = []
df1['adr_of_best_match'] = '' # to add best match from df2 to df1
df1['ID_of_best_match'] = '' # to add best match from df2 to df1
for i in range(0, len(df1.index)):
score = []
ind = []
for j in range(0, len(df2.index)):
f = fuzz.token_set_ratio(df1.address.values[i], df2.address.values[j])/100 # count fuzz rate of texts match
score.append(f)
ind = np.argmax(score)
indices.append(np.argmax(score))
scores.append(np.max(score))
df1['adr_of_best_match'].values[i] = df2.address.values[ind]
df1['ID_of_best_match'].values[i] = df2.address.values[ind]
df1['scores'] = scores # adding score to final dataframe
But to my regret, such a script compares one address from df1 in a 60M dataframe(df2) for about 1 hour, which is unacceptably long.
How could I speed up this process?