0

I am running levenstein comparison on 50k records. I need to compare each record between each other. Is there a way how to optimize the following code to run it faster? The data is stored in pandas dataframe.

import pandas as pd
import numpy as np
import Levenshtein    

df_s_sorted = df.sort_values(['nonascii_2', 'birth_date'])
    df_similarity = pd.DataFrame()
    q=0
    for index, p in df_s_sorted.iterrows():
        q = q + 1
        print(q)
        for index, p1 in df_s_sorted.iterrows():
             if ((p["birth_date"] == p1["birth_date"]) & (p["name"] != p1["name"] )):
                    if (Levenshtein.distance(p["name"],p1["name"]) == 1):
                        df_similarity = df_similarity.append(p)
                        print(p)
        df_s_sorted.drop(index, inplace=True)
Filip Dzuroska
  • 11
  • 1
  • 1
  • 3

0 Answers0