How to optimize levenstien edit distance on pandas dataframe using python?

Asked Nov 21 '19 at 14:55

Active Nov 21 '19 at 14:55

Viewed 102 times

I am running levenstein comparison on 50k records. I need to compare each record between each other. Is there a way how to optimize the following code to run it faster? The data is stored in pandas dataframe.

import pandas as pd
import numpy as np
import Levenshtein    

df_s_sorted = df.sort_values(['nonascii_2', 'birth_date'])
    df_similarity = pd.DataFrame()
    q=0
    for index, p in df_s_sorted.iterrows():
        q = q + 1
        print(q)
        for index, p1 in df_s_sorted.iterrows():
             if ((p["birth_date"] == p1["birth_date"]) & (p["name"] != p1["name"] )):
                    if (Levenshtein.distance(p["name"],p1["name"]) == 1):
                        df_similarity = df_similarity.append(p)
                        print(p)
        df_s_sorted.drop(index, inplace=True)

asked Nov 21 '19 at 14:55

Filip Dzuroska

So you want only those with `distance == 1`? – Dani Mesejo Nov 21 '19 at 15:06
Yes, only those – Filip Dzuroska Nov 21 '19 at 15:08
https://stackoverflow.com/q/13636848/1358308 does lots of similar things. https://github.com/d6t/d6tjoin might be nice if you actually want to efficiently do a "fuzzy join" rather than your stated question – Sam Mason Nov 21 '19 at 15:12

How to optimize levenstien edit distance on pandas dataframe using python?

0 Answers0