5

So I have some text data that's been messily parsed, and due to that I get names mixed in with the actual data. Is there any kind of package/library that helps identify whether a word is a name or not? (In this case, I would be assuming US/western/euro-centric names)

Otherwise, what would be a good way to flag this? Maybe train a model on a corpus of names and assign each word in the dataset a classification? Just not sure the best way to approach this problem/what kind of model would be suited, or if a solution already exists

ocean800
  • 3,489
  • 13
  • 41
  • 73

1 Answers1

5
import nltk
from nltk.tag.stanford import NERTagger
st = NERTagger('stanford-ner/all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar')
text = """YOUR TEXT GOES HERE"""

for sent in nltk.sent_tokenize(text):
    tokens = nltk.tokenize.word_tokenize(sent)
    tags = st.tag(tokens)
    for tag in tags:
        if tag[1]=='PERSON': print tag

via Improving the extraction of human names with nltk

zbush548
  • 244
  • 1
  • 10