I want to calculate Levenshtein distance for all rows of a single column of a Pandas DataFrame. I am getting MemoryError when I cross-join my DataFrame containing ~115,000 rows. In the end, I want to keep only those rows where Levenshtein distance is either 1 or 2. Is there an optimized way to do the same?
Here's my brute force approach:
import pandas as pd
from textdistance import levenshtein
# from itertools import product
# original df
df = pd.DataFrame({'Name':['John', 'Jon', 'Ron'], 'Phone':[123, 456, 789], 'State':['CA', 'GA', 'MA']})
# create another df containing all rows and a few columns needed for further checks
name = df['Name']
phone = df['Phone']
dic_ = {'Name_Match':name,'Phone_Match':phone}
df_match = pd.DataFrame(dic_, index=range(len(name)))
df['key'] = 1
df_match['key'] = 1
# cross join df containing all columns with another df containing some of its columns
df_merged = pd.merge(df, df_match, on='key').drop("key",1)
# keep only rows where distance = 1 or distance = 2
df_merged['distance'] = df_merged.apply(lambda x: levenshtein.distance(x['Name'], x['Name_Match']), axis=1)
Original DataFrame:
Out[1]:
Name Phone State
0 John 123 CA
1 Jon 456 GA
2 Ron 789 MA
New DataFrame from original DataFrame:
df_match
Out[2]:
Name_Match Phone_Match
0 John 123
1 Jon 456
2 Ron 789
Cross-join:
df_merged
Out[3]:
Name Phone State Name_Match Phone_Match distance
0 John 123 CA John 123 0
1 John 123 CA Jon 456 1
2 John 123 CA Ron 789 2
3 Jon 456 GA John 123 1
4 Jon 456 GA Jon 456 0
5 Jon 456 GA Ron 789 1
6 Ron 789 MA John 123 2
7 Ron 789 MA Jon 456 1
8 Ron 789 MA Ron 789 0
Final output:
df_merged[((df_merged.distance==1)==True) | ((df_merged.distance==2)==True)]
Out[4]:
Name Phone State Name_Match Phone_Match distance
1 John 123 CA Jon 456 1
2 John 123 CA Ron 789 2
3 Jon 456 GA John 123 1
5 Jon 456 GA Ron 789 1
6 Ron 789 MA John 123 2
7 Ron 789 MA Jon 456 1