2

I want to build a function that compare rows between two columns. The objective is to discover if a string is similar to other and replace them. These strings might have errors (some unwanted letter or spaces). To face this problem, I did something like that:

from difflib import SequenceMatcher
import pandas as pd
import numpy as np


def similarity(column1: str, column2: str, dataframe1, dataframe2):
    series1 = dataframe1[column1]
    series2 = dataframe2[column2]
    results = []
    for check in series1:
        for row in series2:
            rating = SequenceMatcher(None, check, row).ratio()
            if rating >= 0.7:
                results.append([check, row, rating])
                series1 = series1.replace([check], row)
            else:
                None
    return series1

Yes! It is working, but for 30 minutes so far! I actually have no idea if will return the expected result, but there is a better way to do that (than that linear search approach using for loops)? I am using on a DataFrame with 171284 rows

Thank You!

  • Would you mind adding some example data alongside how the function is called? This may aid optimizing the code. – Michael Hodel Aug 22 '22 at 23:51
  • You're making 171284 * 171284 = more than 29 billion comparisons. At the very least, if you're not sure it will return the expected result, you should try with a much smaller dataframe and verify that it works as expected. – Kraigolas Aug 22 '22 at 23:57
  • Kraig and @MichaelHodel. It was a string matching problem that could be fixed using the fuzzy string matching using the package fuzzywuzzy. Take a look: https://stackoverflow.com/questions/13636848/is-it-possible-to-do-fuzzy-match-merge-with-python-pandas – Matheus Castro Aug 23 '22 at 12:49

0 Answers0