Data matching in python (fuzzy, levenstein?)

Question

I would like to do something like an excel fuzzy v-lookup, but in python. I have a list of ~10,000 concatenated strings like:

JohnSmith5159LosAngeles
JaneDo7729NewYork
etc

that I would like to look up similar strings among a list formatted the same way (and I would like to specify the % match), and put the match in the adjacent column. Which would hopefully return something like:

JohnSmith5159LosAngeles |  JohnDSmith5159LosAngelez
JaneDo7729NewYork       |  JaneDoe7719NewYork
etc

Does anyone know if there's a simple known script or way to do this? pleeassse :)

Look up the jellyfish library. It has a bunch of string similarity metrics. — Prateek Dewan, May 25 '20 at 17:14
I wrote a function in excel but its too slow and I don't know how to turn it in to python :/ @ScottHunter — Henrietta Shalansky, May 25 '20 at 17:15

mechanical_meat · Answer 1 · 2020-05-25T18:28:05.773

There's the built-in difflib which has the capability of returning a ratio of string similarity.

Then I wrote a function to return the string of max similarity, as long as that similarity ratio is greater than or equal to 0.9 between the two strings.

There is a small shortcut to check if the first letter of the two strings match. If not, the calculation is not run. This idea is from here: https://stackoverflow.com/a/20478493/42346

And I presume you're working in pandas so I added an example of how to populate a pandas DataFrame with the results.

from difflib import SequenceMatcher

def similar(a, b):
    """ https://stackoverflow.com/a/17388505/42346 """
    return SequenceMatcher(None, a, b).ratio()

def max_similar_string(s, l):
    first_letter_match = [i for i in l if s[0] == i[0]]
    if len(first_letter_match):
        max_ratio = 0
        max_ratio_string = ''
        for i in l:
            ratio = similar(i,s)
            if ratio > max_ratio:
                max_ratio = ratio
                max_ratio_string = i
        if max_ratio >= 0.9:
            return (max_ratio_string,max_ratio)
    else:
        return (np.nan,np.nan)

l = ['JohnDSmith5159LosAngelez','JaneDoe7719NewYork']

df = pd.DataFrame({'s':['JohnSmith5159LosAngeles','JaneDo7729NewYork']}) 

for idx in df.index: 
    df.loc[idx,'similar'], df.loc[idx,'ratio'] = max_similar_string(df.loc[idx,'s'],l)

Result:

                         s                   similar     ratio
0  JohnSmith5159LosAngeles  JohnDSmith5159LosAngelez  0.936170
1        JaneDo7729NewYork        JaneDoe7719NewYork  0.914286

@HenriettaShalansky: you're very welcome! Happy coding to you. — mechanical_meat, May 27 '20 at 04:34

Data matching in python (fuzzy, levenstein?)

1 Answers1