Intro
Hello, I'm working on a project that requires me to replace dictionary keys within a pandas column of text with values - but with potential misspellings. Specifically I am matching names within a pandas column of text and replacing them with "First Name". For example, I would be replacing "tommy" with "First Name".
However, I realize there's the issue of misspelled names and text within the column of strings that won't be replaced by my dictionary. For example 'tommmmy" has extra m's and is not a first name within my dictionary.
#Create df
d = {'message' : pd.Series(['awesome', 'my name is tommmy , please help with...', 'hi tommy , we understand your quest...'])}
names = ["tommy", "zelda", "marcon"]
#create dict
namesdict = {r'(^|\s){}($|\s)'.format(el): r'\1FirstName\2' for el in names}
#replace
d['message'].replace(namesdict, regex = True)
#output
Out:
0 awesome
1 my name is tommmy , please help with...
2 hi FirstName , we understand your quest...
dtype: object
so "tommmy" doesn't match to "tommy" in the -> I need to deal with misspellings. I thought about trying to do this prior to the actual dictionary key and value replacement, like scan through the pandas data frame and replace the words within the column of strings ("messages") with the appropriate name. I've seen a similar approach using an index on specific strings like this one
but how do you match and replace words within the sentences within a pandas df, using a list of correct spelling? Can I do this within the df.series replace argument? Should I stick with a regex string replace?*
Any suggestions appreciated.
Update , trying Yannis's answer
I'm trying Yannis's answer but I need to use a list from an outside source, specifically the US census of first names for matching. But it's not matching on the whole names with the string I download.
d = {'message' : pd.Series(['awesome', 'my name is tommy , please help with...', 'hi tommy , we understand your quest...'])}
import requests
r = requests.get('http://deron.meranda.us/data/census-derived-all-first.txt')
#US Census first names (5000 +)
firstnamelist = re.findall(r'\n(.*?)\s', r.text, re.DOTALL)
#turn list to string, force lower case
fnstring = ', '.join('"{0}"'.format(w) for w in firstnamelist )
fnstring = ','.join(firstnamelist)
fnstring = (fnstring.lower())
##turn to list, prepare it so it matches the name preceded by either the beginning of the string or whitespace.
names = [x.strip() for x in fnstring.split(',')]
#import jellyfish
import difflib
def best_match(tokens, names):
for i,t in enumerate(tokens):
closest = difflib.get_close_matches(t, names, n=1)
if len(closest) > 0:
return i, closest[0]
return None
def fuzzy_replace(x, y):
names = y # just a simple replacement list
tokens = x.split()
res = best_match(tokens, y)
if res is not None:
pos, replacement = res
tokens[pos] = "FirstName"
return u" ".join(tokens)
return x
d["message"].apply(lambda x: fuzzy_replace(x, names))
Results in:
Out:
0 FirstName
1 FirstName name is tommy , please help with...
2 FirstName tommy , we understand your quest...
But if I use a smaller list like this it works:
names = ["tommy", "caitlyn", "kat", "al", "hope"]
d["message"].apply(lambda x: fuzzy_replace(x, names))
Is it something with the longer list of names that's causing a problem?