-2

I would like to do something like an excel fuzzy v-lookup, but in python. I have a list of ~10,000 concatenated strings like:

JohnSmith5159LosAngeles
JaneDo7729NewYork
etc

that I would like to look up similar strings among a list formatted the same way (and I would like to specify the % match), and put the match in the adjacent column. Which would hopefully return something like:

JohnSmith5159LosAngeles |  JohnDSmith5159LosAngelez
JaneDo7729NewYork       |  JaneDoe7719NewYork
etc            

Does anyone know if there's a simple known script or way to do this? pleeassse :)

1 Answers1

1

There's the built-in difflib which has the capability of returning a ratio of string similarity.

Then I wrote a function to return the string of max similarity, as long as that similarity ratio is greater than or equal to 0.9 between the two strings.

There is a small shortcut to check if the first letter of the two strings match. If not, the calculation is not run. This idea is from here: https://stackoverflow.com/a/20478493/42346

And I presume you're working in pandas so I added an example of how to populate a pandas DataFrame with the results.

from difflib import SequenceMatcher

def similar(a, b):
    """ https://stackoverflow.com/a/17388505/42346 """
    return SequenceMatcher(None, a, b).ratio()

def max_similar_string(s, l):
    first_letter_match = [i for i in l if s[0] == i[0]]
    if len(first_letter_match):
        max_ratio = 0
        max_ratio_string = ''
        for i in l:
            ratio = similar(i,s)
            if ratio > max_ratio:
                max_ratio = ratio
                max_ratio_string = i
        if max_ratio >= 0.9:
            return (max_ratio_string,max_ratio)
    else:
        return (np.nan,np.nan)

l = ['JohnDSmith5159LosAngelez','JaneDoe7719NewYork']

df = pd.DataFrame({'s':['JohnSmith5159LosAngeles','JaneDo7729NewYork']}) 

for idx in df.index: 
    df.loc[idx,'similar'], df.loc[idx,'ratio'] = max_similar_string(df.loc[idx,'s'],l) 

Result:

                         s                   similar     ratio
0  JohnSmith5159LosAngeles  JohnDSmith5159LosAngelez  0.936170
1        JaneDo7729NewYork        JaneDoe7719NewYork  0.914286
mechanical_meat
  • 163,903
  • 24
  • 228
  • 223