There's the built-in difflib which has the capability of returning a ratio of string similarity.
Then I wrote a function to return the string of max similarity, as long as that similarity ratio is greater than or equal to 0.9 between the two strings.
There is a small shortcut to check if the first letter of the two strings match. If not, the calculation is not run. This idea is from here: https://stackoverflow.com/a/20478493/42346
And I presume you're working in pandas so I added an example of how to populate a pandas DataFrame with the results.
from difflib import SequenceMatcher
def similar(a, b):
""" https://stackoverflow.com/a/17388505/42346 """
return SequenceMatcher(None, a, b).ratio()
def max_similar_string(s, l):
first_letter_match = [i for i in l if s[0] == i[0]]
if len(first_letter_match):
max_ratio = 0
max_ratio_string = ''
for i in l:
ratio = similar(i,s)
if ratio > max_ratio:
max_ratio = ratio
max_ratio_string = i
if max_ratio >= 0.9:
return (max_ratio_string,max_ratio)
else:
return (np.nan,np.nan)
l = ['JohnDSmith5159LosAngelez','JaneDoe7719NewYork']
df = pd.DataFrame({'s':['JohnSmith5159LosAngeles','JaneDo7729NewYork']})
for idx in df.index:
df.loc[idx,'similar'], df.loc[idx,'ratio'] = max_similar_string(df.loc[idx,'s'],l)
Result:
s similar ratio
0 JohnSmith5159LosAngeles JohnDSmith5159LosAngelez 0.936170
1 JaneDo7729NewYork JaneDoe7719NewYork 0.914286