How to recognize if string is human name?

Question

So I have some text data that's been messily parsed, and due to that I get names mixed in with the actual data. Is there any kind of package/library that helps identify whether a word is a name or not? (In this case, I would be assuming US/western/euro-centric names)

Otherwise, what would be a good way to flag this? Maybe train a model on a corpus of names and assign each word in the dataset a classification? Just not sure the best way to approach this problem/what kind of model would be suited, or if a solution already exists

You could create a text file with a list of every name and loop through your data? Not efficient but still... — Evorage, Sep 28 '20 at 20:54
So are Paris, Hilton, and Brooklyn names, brands, or places? Hoover, Bear, ... Good luck with this. — DisappointedByUnaccountableMod, Sep 28 '20 at 20:55
@barny good point, but im just looking for overall improvement in my dataset, I don't need perfect results in this case, so examples like that would be within an acceptable margin of error. — ocean800, Sep 28 '20 at 20:57
Does this answer your question? [Improving the extraction of human names with nltk](https://stackoverflow.com/questions/20290870/improving-the-extraction-of-human-names-with-nltk) — bsplosion, Jul 19 '21 at 16:51

score 5 · Accepted Answer · answered Sep 28 '20 at 20:55

5

import nltk
from nltk.tag.stanford import NERTagger
st = NERTagger('stanford-ner/all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar')
text = """YOUR TEXT GOES HERE"""

for sent in nltk.sent_tokenize(text):
    tokens = nltk.tokenize.word_tokenize(sent)
    tags = st.tag(tokens)
    for tag in tags:
        if tag[1]=='PERSON': print tag

via Improving the extraction of human names with nltk

answered Sep 28 '20 at 20:55

zbush548

244
1
10

1

Oh I didn't think of using `NER` for this, might be the way to go, thank you – ocean800 Sep 28 '20 at 21:03

How to recognize if string is human name?

1 Answers1