1

I want to calculate Levenshtein distance for all rows of a single column of a Pandas DataFrame. I am getting MemoryError when I cross-join my DataFrame containing ~115,000 rows. In the end, I want to keep only those rows where Levenshtein distance is either 1 or 2. Is there an optimized way to do the same?

Here's my brute force approach:

import pandas as pd
from textdistance import levenshtein
# from itertools import product

# original df
df = pd.DataFrame({'Name':['John', 'Jon', 'Ron'], 'Phone':[123, 456, 789], 'State':['CA', 'GA', 'MA']})
# create another df containing all rows and a few columns needed for further checks
name = df['Name']
phone = df['Phone']
dic_ = {'Name_Match':name,'Phone_Match':phone}
df_match = pd.DataFrame(dic_, index=range(len(name)))

df['key'] = 1
df_match['key'] = 1

# cross join df containing all columns with another df containing some of its columns
df_merged = pd.merge(df, df_match, on='key').drop("key",1)

# keep only rows where distance = 1 or distance = 2
df_merged['distance'] = df_merged.apply(lambda x: levenshtein.distance(x['Name'], x['Name_Match']), axis=1)

Original DataFrame:

Out[1]:   
   Name  Phone State  
0  John    123    CA  
1   Jon    456    GA  
2   Ron    789    MA  

New DataFrame from original DataFrame:

df_match
Out[2]: 
  Name_Match  Phone_Match
0       John          123
1        Jon          456
2        Ron          789

Cross-join:

df_merged
Out[3]: 
   Name  Phone State Name_Match  Phone_Match  distance
0  John    123    CA       John          123         0
1  John    123    CA        Jon          456         1
2  John    123    CA        Ron          789         2
3   Jon    456    GA       John          123         1
4   Jon    456    GA        Jon          456         0
5   Jon    456    GA        Ron          789         1
6   Ron    789    MA       John          123         2
7   Ron    789    MA        Jon          456         1
8   Ron    789    MA        Ron          789         0

Final output:

df_merged[((df_merged.distance==1)==True) | ((df_merged.distance==2)==True)]
Out[4]: 
   Name  Phone State Name_Match  Phone_Match  distance
1  John    123    CA        Jon          456         1
2  John    123    CA        Ron          789         2
3   Jon    456    GA       John          123         1
5   Jon    456    GA        Ron          789         1
6   Ron    789    MA       John          123         2
7   Ron    789    MA        Jon          456         1
Chaitya Shah
  • 25
  • 1
  • 5
  • 1
    Does this answer your question? [Highly parallelizable Levenstein Distance Algorithm](https://stackoverflow.com/questions/16036140/highly-parallelizable-levenstein-distance-algorithm) – Florian Fasmeyer Mar 16 '21 at 16:36
  • @FlorianFasmeyer the OP has memory issues, your solution is for parallelizing the operation and it's not even for python – Jimmar Mar 16 '21 at 16:58

1 Answers1

0

Your problem is not related to levenshtein distance, your main problem is that you are running out of device memory (RAM) while doing the operations (you can check it using the task manager in windows or the top or htop commands on linux/mac).

One solution would be to partition your dataframe before starting the apply operation into smaller partitions and running it on each partition then deleting the ones that you don't need BEFORE processing the next partition.

If you are running it on the cloud, you can get a machine with more memory instead.

Bonus: I'd suggest you parallelize the apply operation using something like Pandarallel to make it way faster.

Jimmar
  • 4,194
  • 2
  • 28
  • 43