I am running levenstein comparison on 50k records. I need to compare each record between each other. Is there a way how to optimize the following code to run it faster? The data is stored in pandas dataframe.
import pandas as pd
import numpy as np
import Levenshtein
df_s_sorted = df.sort_values(['nonascii_2', 'birth_date'])
df_similarity = pd.DataFrame()
q=0
for index, p in df_s_sorted.iterrows():
q = q + 1
print(q)
for index, p1 in df_s_sorted.iterrows():
if ((p["birth_date"] == p1["birth_date"]) & (p["name"] != p1["name"] )):
if (Levenshtein.distance(p["name"],p1["name"]) == 1):
df_similarity = df_similarity.append(p)
print(p)
df_s_sorted.drop(index, inplace=True)