2

Intro

Hello, I'm working on a project that requires me to replace dictionary keys within a pandas column of text with values - but with potential misspellings. Specifically I am matching names within a pandas column of text and replacing them with "First Name". For example, I would be replacing "tommy" with "First Name".

However, I realize there's the issue of misspelled names and text within the column of strings that won't be replaced by my dictionary. For example 'tommmmy" has extra m's and is not a first name within my dictionary.

#Create df 
d = {'message' : pd.Series(['awesome', 'my name is tommmy , please help with...', 'hi tommy , we understand your quest...'])}
names = ["tommy", "zelda", "marcon"]

#create dict 
namesdict = {r'(^|\s){}($|\s)'.format(el): r'\1FirstName\2' for el in names}

#replace 
d['message'].replace(namesdict, regex = True)



  #output 
    Out: 
0                                       awesome
1    my name is tommmy , please help with...
2    hi FirstName , we understand your quest...
dtype: object

so "tommmy" doesn't match to "tommy" in the -> I need to deal with misspellings. I thought about trying to do this prior to the actual dictionary key and value replacement, like scan through the pandas data frame and replace the words within the column of strings ("messages") with the appropriate name. I've seen a similar approach using an index on specific strings like this one

but how do you match and replace words within the sentences within a pandas df, using a list of correct spelling? Can I do this within the df.series replace argument? Should I stick with a regex string replace?*

Any suggestions appreciated.

Update , trying Yannis's answer

I'm trying Yannis's answer but I need to use a list from an outside source, specifically the US census of first names for matching. But it's not matching on the whole names with the string I download.

d = {'message' : pd.Series(['awesome', 'my name is tommy , please help with...', 'hi tommy , we understand your quest...'])}

import requests 
r = requests.get('http://deron.meranda.us/data/census-derived-all-first.txt')

#US Census first names (5000 +) 
firstnamelist = re.findall(r'\n(.*?)\s', r.text, re.DOTALL)


#turn list to string, force lower case
fnstring = ', '.join('"{0}"'.format(w) for w in firstnamelist )
fnstring  = ','.join(firstnamelist)
fnstring  = (fnstring.lower())


##turn to list, prepare it so it matches the name preceded by either the beginning of the string or whitespace.  
names = [x.strip() for x in fnstring.split(',')]




#import jellyfish 
import difflib 


def best_match(tokens, names):
    for i,t in enumerate(tokens):
        closest = difflib.get_close_matches(t, names, n=1)
        if len(closest) > 0:
            return i, closest[0]
    return None

def fuzzy_replace(x, y):
    
    names = y # just a simple replacement list
    tokens = x.split()
    res = best_match(tokens, y)
    if res is not None:
        pos, replacement = res
        tokens[pos] = "FirstName"
        return u" ".join(tokens)
    return x

d["message"].apply(lambda x: fuzzy_replace(x, names))

Results in:

Out: 
0                                        FirstName
1    FirstName name is tommy , please help with...
2    FirstName tommy , we understand your quest...

But if I use a smaller list like this it works:

names = ["tommy", "caitlyn", "kat", "al", "hope"]
d["message"].apply(lambda x: fuzzy_replace(x, names))

Is it something with the longer list of names that's causing a problem?

Community
  • 1
  • 1
Peachazoid
  • 57
  • 1
  • 8

1 Answers1

1

Edit:

Changed my solution to use difflib. The core idea is to tokenize your input text and match each token against a list of names. If best_match finds a match then it reports the position (and the best matching string), so then you can replace the token with "FirstName" or anything you want. See the complete example below:

import pandas as pd
import difflib

df = pd.DataFrame(data=[(0,"my name is tommmy , please help with"), (1, "hi FirstName , we understand your quest")], columns=["A", "message"])

def best_match(tokens, names):
    for i,t in enumerate(tokens):
        closest = difflib.get_close_matches(t, names, n=1)
        if len(closest) > 0:
            return i, closest[0]
    return None

def fuzzy_replace(x):
    names = ["tommy", "john"] # just a simple replacement list
    tokens = x.split()
    res = best_match(tokens, names)
    if res is not None:
        pos, replacement = res
        tokens[pos] = "FirstName"
        return u" ".join(tokens)
    return x

df.message.apply(lambda x: fuzzy_replace(x))

And the output you should get is the following

0    my name is FirstName , please help with
1    hi FirstName , we understand your quest
Name: message, dtype: object

Edit 2

After the discussion, I decided to have another go, using NLTK for parts of speech tagging and run the fuzzy matching only for the NNP tags (proper nouns) against the name list. The problem is that sometimes the tagger doesn't get the tag right, e.g. "Hi" might be also tagged as proper noun. However if the list of names are lowercased then get_close_matches doesn't match Hi against a name but matches all other names. I recommend that df["message"] is not lowercased to increase the chances that NLTK tags the names properly. One can also play with StanfordNER but nothing will work 100%. Here is the code:

import pandas as pd
import difflib
from nltk import pos_tag, wordpunct_tokenize
import requests 
import re

r = requests.get('http://deron.meranda.us/data/census-derived-all-first.txt')

# US Census first names (5000 +) 
firstnamelist = re.findall(r'\n(.*?)\s', r.text, re.DOTALL)

# turn list to string, force lower case
# simplified things here
names = [w.lower() for w in firstnamelist]


df = pd.DataFrame(data=[(0,"My name is Tommmy, please help with"), 
                        (1, "Hi Tommy , we understand your question"),
                        (2, "I don't talk to Johhn any longer"),
                        (3, 'Michale says this is stupid')
                       ], columns=["A", "message"])

def match_names(token, tag):
    print token, tag
    if tag == "NNP":
        best_match = difflib.get_close_matches(token, names, n=1)
        if len(best_match) > 0:
            return "FirstName" # or best_match[0] if you want to return the name found
        else:
            return token
    else:
        return token

def fuzzy_replace(x):
    tokens = wordpunct_tokenize(x)
    pos_tokens = pos_tag(tokens)
    # Every token is a tuple (token, tag)
    result = [match_names(token, tag) for token, tag in pos_tokens]
    x = u" ".join(result)
    return x

df['message'].apply(lambda x: fuzzy_replace(x))

And I get in the output:

0       My name is FirstName , please help with
1    Hi FirstName , we understand your question
2        I don ' t talk to FirstName any longer
3                 FirstName says this is stupid
Name: message, dtype: object
Yannis P.
  • 2,745
  • 1
  • 24
  • 39
  • I'm actually trying to replace the misspelled names without explicitly stating how they're misspelled. Meaning, I should not have to include repl = {"tommmy":"tommy"} in the function you applied. But I do agree an apply across the column sentences would work. Thanks for the tip – Peachazoid Jul 27 '17 at 18:53
  • then probably inside the `fuzzy_replace` you can call `Jaro-Winkler` as used in the linked post. – Yannis P. Jul 27 '17 at 18:55
  • Check my edited answer. I have devised a strategy to work with it only it might be a bit slow for many names. I think you can adapt this to your problem – Yannis P. Jul 27 '17 at 19:50
  • Hi Yannis, your answer is definitely helping. But I need to use a longer list with names from the US Census (5k names). Can you check my update above if you have any advice? – Peachazoid Jul 27 '17 at 22:20
  • Apparently `hi` and `my` are taken as first names so the solution has to be reworked but I ll come back to it – Yannis P. Jul 28 '17 at 08:30
  • What is the main reason to lowercase our input strings? – Yannis P. Jul 28 '17 at 09:28
  • I meant Series `message`. If not then it might be a good idea to [tag tokens](http://www.nltk.org/book/ch05.html) with the parts of speech with NLTK and run the matching against the names only when the tagged token is a noun person (tag `NNP`). If the messages are lowercased before passed to NLTK, then the names are not recognized that easily. This again will have some flows, we cannot be 100% accurate – Yannis P. Jul 28 '17 at 10:30
  • Sorry can you expand on what you mean, what does "meant series message" do you mean turn message into a series? Also I agree on the tagging NNP first, but would it be able to tag misspelled words? If not i'm still in the same situation where 'tommmy" needs to be "tommy". – Peachazoid Jul 28 '17 at 17:25
  • Right so I meant column message in variable d in your code – Yannis P. Jul 28 '17 at 17:33
  • Oh so you're inquiring as to why I lower case the "message" column? I'm doing it to make sure its easier to match between the retrieved text and the names I need to download from the census. – Peachazoid Jul 28 '17 at 17:35
  • But I'll take a stab at the NLTK attempt. I may repost once I take a stab at it! But your answer works at current stage, thanks so much – Peachazoid Jul 28 '17 at 17:39
  • I look at it in the morning and it tags tommmy as nnp which is good. I bet if you isolate the NNP tags thrn you can match against the names in case you want to make sure it is a misspelled name – Yannis P. Jul 28 '17 at 17:47
  • Can you show me how you did that? And how you extracted the NNP tag to match? That would be super helpful. – Peachazoid Jul 28 '17 at 17:57
  • I had a go check the edited solution. Thanks for accepting the solution, I hope it will work for you. You might need to test for some extreme cases and hardcode or make some hacks – Yannis P. Jul 30 '17 at 15:00
  • 1
    Ah ok this looks good. But now I realize that because it has NNP matching all proper names, it could hit and match locations as well. For example in this code, "Cayce" could mean either "Cayce, SC" or "Cayce" a woman's name. I think a better way is to use NLTK's named entity recognition where if tag == "LOCATION" then replace "LOCATION" after fuzzy matching, or if tag == "PERSON" then replace "PERSON" using the NERTagger function from nltk.tag.stanford (https://stackoverflow.com/questions/18371092/stanford-named-entity-recognizer-ner-functionality-with-nltk) – Peachazoid Jul 31 '17 at 21:33
  • Hey Yannis, quick question. I see you define "fnstring" but you are using "names". Is "names" a different list you meant in here? – Peachazoid Aug 01 '17 at 05:44
  • Darn! That's clearly an oversight but since I was working in Jupyter notebook, `names` were already in the namespace – Yannis P. Aug 01 '17 at 13:53
  • Ah ok. Thanks!. so here's what I'm thinking. so i can do PoS tag, if its proper noun, then scan reference list for spelling , match to closest spelling and replace. This is important to use PoS so we don't use "Hi" as a named entity. If it did match to a "Firstname" then replace with "First Name" (same with last name, and if it matched 'nick name' replace with "first name"). For all other proper names that are not "first name", run through NER, if "Person" but not first name or last name just replace with "Name". if "Location" then replace with "location". is that possible you think? – Peachazoid Aug 01 '17 at 18:43