I want to build a function that compare rows between two columns. The objective is to discover if a string is similar to other and replace them. These strings might have errors (some unwanted letter or spaces). To face this problem, I did something like that:
from difflib import SequenceMatcher
import pandas as pd
import numpy as np
def similarity(column1: str, column2: str, dataframe1, dataframe2):
series1 = dataframe1[column1]
series2 = dataframe2[column2]
results = []
for check in series1:
for row in series2:
rating = SequenceMatcher(None, check, row).ratio()
if rating >= 0.7:
results.append([check, row, rating])
series1 = series1.replace([check], row)
else:
None
return series1
Yes! It is working, but for 30 minutes so far! I actually have no idea if will return the expected result, but there is a better way to do that (than that linear search approach using for loops)? I am using on a DataFrame with 171284 rows
Thank You!