Improving the extraction of human names with nltk

Question

I am trying to extract human names from text.

Does anyone have a method that they would recommend?

This is what I tried (code is below): I am using nltk to find everything marked as a person and then generating a list of all the NNP parts of that person. I am skipping persons where there is only one NNP which avoids grabbing a lone surname.

I am getting decent results but was wondering if there are better ways to go about solving this problem.

Code:

import nltk
from nameparser.parser import HumanName

def get_human_names(text):
    tokens = nltk.tokenize.word_tokenize(text)
    pos = nltk.pos_tag(tokens)
    sentt = nltk.ne_chunk(pos, binary = False)
    person_list = []
    person = []
    name = ""
    for subtree in sentt.subtrees(filter=lambda t: t.node == 'PERSON'):
        for leaf in subtree.leaves():
            person.append(leaf[0])
        if len(person) > 1: #avoid grabbing lone surnames
            for part in person:
                name += part + ' '
            if name[:-1] not in person_list:
                person_list.append(name[:-1])
            name = ''
        person = []

    return (person_list)

text = """
Some economists have responded positively to Bitcoin, including 
Francois R. Velde, senior economist of the Federal Reserve in Chicago 
who described it as "an elegant solution to the problem of creating a 
digital currency." In November 2013 Richard Branson announced that 
Virgin Galactic would accept Bitcoin as payment, saying that he had invested 
in Bitcoin and found it "fascinating how a whole new global currency 
has been created", encouraging others to also invest in Bitcoin.
Other economists commenting on Bitcoin have been critical. 
Economist Paul Krugman has suggested that the structure of the currency 
incentivizes hoarding and that its value derives from the expectation that 
others will accept it as payment. Economist Larry Summers has expressed 
a "wait and see" attitude when it comes to Bitcoin. Nick Colas, a market 
strategist for ConvergEx Group, has remarked on the effect of increasing 
use of Bitcoin and its restricted supply, noting, "When incremental 
adoption meets relatively fixed supply, it should be no surprise that 
prices go up. And that’s exactly what is happening to BTC prices."
"""

names = get_human_names(text)
print "LAST, FIRST"
for name in names: 
    last_first = HumanName(name).last + ', ' + HumanName(name).first
        print last_first

Output:

LAST, FIRST
Velde, Francois
Branson, Richard
Galactic, Virgin
Krugman, Paul
Summers, Larry
Colas, Nick

Apart from Virgin Galactic, this is all valid output. Of course, knowing that Virgin Galactic isn't a human name in the context of this article is the hard (maybe impossible) part.

While interesting, it isn't clear what the actual question is here. A suggestion to "make my code better" isn't well suited for this site. — Hooked, Nov 29 '13 at 19:15
Thanks, basically my question is: I want to extract names from text. This is what I tried, it works OK, but not fantastically well. Are there any alternatives to solving this problem that anyone would recommend? I'll edit the question to improve it. — e h, Nov 30 '13 at 13:43
thanks for sharing. i was able to use your code, but i ran into two errors needing fixing. first i got the error: `SyntaxError: Non-ASCII character.... no encoding declared` which was fixed by adding on line 1: `# -- coding: UTF-8 -- ` then i got the error: `NotImplementedError("Use label() to access a node label.` which was fixed by removing "node" from line 17 as follows: `for subtree in sentt.subtrees(filter=lambda t: t.label() == 'PERSON'):` — sharon, Jul 25 '15 at 02:04
if you are hoping to use this code today. make sure to place these after import statements. nltk.download('punkt'); nltk.download('averaged_perceptron_tagger'); nltk.download('maxent_ne_chunker'); nltk.download('words'); Addition to that make sure to replace t.node with t.label() — Gihan Gamage, Apr 16 '20 at 17:25

NG_ · Accepted Answer · 2022-04-09T13:36:03.663

34

Must agree with the suggestion that "make my code better" isn't well suited for this site, but I can give you some way where you can try to dig in.

Disclaimer: This answer is ~7 years old. Definitely, it needs to be updated to newer Python and NLTK versions. Please, try to do it yourself, and if it works, share your know-how with us.

Take a look at Stanford Named Entity Recognizer (NER). Its binding has been included in NLTK v 2.0, but you must download some core files. Here is script which can do all of that for you.

I wrote this script:

import nltk
from nltk.tag.stanford import NERTagger
st = NERTagger('stanford-ner/all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar')
text = """YOUR TEXT GOES HERE"""

for sent in nltk.sent_tokenize(text):
    tokens = nltk.tokenize.word_tokenize(sent)
    tags = st.tag(tokens)
    for tag in tags:
        if tag[1]=='PERSON': print tag

and got not so bad output:

('Francois', 'PERSON') ('R.', 'PERSON') ('Velde', 'PERSON') ('Richard', 'PERSON') ('Branson', 'PERSON') ('Virgin', 'PERSON') ('Galactic', 'PERSON') ('Bitcoin', 'PERSON') ('Bitcoin', 'PERSON') ('Paul', 'PERSON') ('Krugman', 'PERSON') ('Larry', 'PERSON') ('Summers', 'PERSON') ('Bitcoin', 'PERSON') ('Nick', 'PERSON') ('Colas', 'PERSON')

Hope this is helpful.

edited Apr 09 '22 at 13:36

answered Jun 09 '14 at 11:13

NG_

6,895
7
45
67

2

he wanted the output to be in First name and last name. NER will give only PERSON label. – Rohan Amrute May 19 '16 at 10:28
4

This solution gives first name and last name separately and not combines together. You will run into issues if there is a middle name. Worse, if you have a name with four words, in that case it will be grouped into 2 names if we just combine 2 consecutive words to find a name. And as such does not answer the question. Thanks! – StatguyUser Nov 28 '16 at 12:20
Does this work for names in different languages too? If not, then how to do this? I am using Indian names. – Himanshu Suthar May 28 '19 at 10:55
This part did not go through: " from nltk.tag.stanford import NERTagger" – tursunWali Mar 24 '21 at 03:53
@tursunWali sorry to hear that. This answer is ~7 years old. Definitely it needs to be updated to newer Python and NLTK versions. – NG_ Mar 25 '21 at 08:56
It's not working we need updated code ? – Aravind R Mar 14 '22 at 10:26
cannot import name 'NERTagger' – Sashko Lykhenko Apr 07 '22 at 16:16
Dear @SashkoLykhenko sorry to hear that. I've added a disclaimer to my answer. Please, try to do it yourself, and if it works, share your know-how with us. Then I can update my answer with your latest findings. – NG_ Apr 09 '22 at 13:37

score 12 · Answer 2 · answered Feb 25 '15 at 20:27

12

For anyone else looking, I found this article to be useful: http://timmcnamara.co.nz/post/2650550090/extracting-names-with-6-lines-of-python-code

>>> import nltk
>>> def extract_entities(text):
...     for sent in nltk.sent_tokenize(text):
...         for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
...             if hasattr(chunk, 'node'):
...                 print chunk.node, ' '.join(c[0] for c in chunk.leaves())
...

answered Feb 25 '15 at 20:27

Curtis Mattoon

4,642
2
27
34

1

What if a name has a middle name? It will not withstand – StatguyUser Nov 28 '16 at 13:13
1

I got this error in a newer version of NLTK : notimplementederror use label() to access a node label. Resolved it by changing last two lines to following : if hasattr(chunk, 'label'): print(chunk.label(), ' '.join(c[0] for c in chunk.leaves())) – Abhishek Poojary Jan 15 '21 at 15:54

Martin Thoma · Answer 3 · 2022-01-11T07:04:09.037

The answer of @trojane didn't quite work for me, but helped a lot for this one.

Prerequesites

Create a folder stanford-ner and download the following two files to it:

english.all.3class.distsim.crf.ser.gz
stanford-ner.jar (Look for download and extract the archive)

Script

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import nltk
from nltk.tag.stanford import StanfordNERTagger

text = u"""
Some economists have responded positively to Bitcoin, including
Francois R. Velde, senior economist of the Federal Reserve in Chicago
who described it as "an elegant solution to the problem of creating a
digital currency." In November 2013 Richard Branson announced that
Virgin Galactic would accept Bitcoin as payment, saying that he had invested
in Bitcoin and found it "fascinating how a whole new global currency
has been created", encouraging others to also invest in Bitcoin.
Other economists commenting on Bitcoin have been critical.
Economist Paul Krugman has suggested that the structure of the currency
incentivizes hoarding and that its value derives from the expectation that
others will accept it as payment. Economist Larry Summers has expressed
a "wait and see" attitude when it comes to Bitcoin. Nick Colas, a market
strategist for ConvergEx Group, has remarked on the effect of increasing
use of Bitcoin and its restricted supply, noting, "When incremental
adoption meets relatively fixed supply, it should be no surprise that
prices go up. And that’s exactly what is happening to BTC prices.
"""

st = StanfordNERTagger('stanford-ner/english.all.3class.distsim.crf.ser.gz',
                       'stanford-ner/stanford-ner.jar')

for sent in nltk.sent_tokenize(text):
    tokens = nltk.tokenize.word_tokenize(sent)
    tags = st.tag(tokens)
    for tag in tags:
        if tag[1] in ["PERSON", "LOCATION", "ORGANIZATION"]:
            print(tag)

Results

('Bitcoin', 'LOCATION')       # wrong
('Francois', 'PERSON')
('R.', 'PERSON')
('Velde', 'PERSON')
('Federal', 'ORGANIZATION')
('Reserve', 'ORGANIZATION')
('Chicago', 'LOCATION')
('Richard', 'PERSON')
('Branson', 'PERSON')
('Virgin', 'PERSON')         # Wrong
('Galactic', 'PERSON')       # Wrong
('Bitcoin', 'PERSON')        # Wrong
('Bitcoin', 'LOCATION')      # Wrong
('Bitcoin', 'LOCATION')      # Wrong
('Paul', 'PERSON')
('Krugman', 'PERSON')
('Larry', 'PERSON')
('Summers', 'PERSON')
('Bitcoin', 'PERSON')        # Wrong
('Nick', 'PERSON')
('Colas', 'PERSON')
('ConvergEx', 'ORGANIZATION')
('Group', 'ORGANIZATION')     
('Bitcoin', 'LOCATION')       # Wrong
('BTC', 'ORGANIZATION')       # Wrong

What does `u"textString"` mean before the String, I know about `r"text\String"` -> which is raw. — abhinit21, Jan 10 '22 at 12:54
This is a left-over of Python 2. I'll remove it. See https://stackoverflow.com/a/2464968/562769 — Martin Thoma, Jan 11 '22 at 07:03

Shivansh bhandari · Answer 4 · 2020-01-02T22:45:42.583

I actually wanted to extract only the person name, so, thought to check all the names that come as an output against wordnet( A large lexical database of English). More Information on Wordnet can be found here: http://www.nltk.org/howto/wordnet.html

import nltk
from nameparser.parser import HumanName
from nltk.corpus import wordnet


person_list = []
person_names=person_list
def get_human_names(text):
    tokens = nltk.tokenize.word_tokenize(text)
    pos = nltk.pos_tag(tokens)
    sentt = nltk.ne_chunk(pos, binary = False)

    person = []
    name = ""
    for subtree in sentt.subtrees(filter=lambda t: t.label() == 'PERSON'):
        for leaf in subtree.leaves():
            person.append(leaf[0])
        if len(person) > 1: #avoid grabbing lone surnames
            for part in person:
                name += part + ' '
            if name[:-1] not in person_list:
                person_list.append(name[:-1])
            name = ''
        person = []
#     print (person_list)

text = """

Some economists have responded positively to Bitcoin, including 
Francois R. Velde, senior economist of the Federal Reserve in Chicago 
who described it as "an elegant solution to the problem of creating a 
digital currency." In November 2013 Richard Branson announced that 
Virgin Galactic would accept Bitcoin as payment, saying that he had invested 
in Bitcoin and found it "fascinating how a whole new global currency 
has been created", encouraging others to also invest in Bitcoin.
Other economists commenting on Bitcoin have been critical. 
Economist Paul Krugman has suggested that the structure of the currency 
incentivizes hoarding and that its value derives from the expectation that 
others will accept it as payment. Economist Larry Summers has expressed 
a "wait and see" attitude when it comes to Bitcoin. Nick Colas, a market 
strategist for ConvergEx Group, has remarked on the effect of increasing 
use of Bitcoin and its restricted supply, noting, "When incremental 
adoption meets relatively fixed supply, it should be no surprise that 
prices go up. And that’s exactly what is happening to BTC prices."
"""

names = get_human_names(text)
for person in person_list:
    person_split = person.split(" ")
    for name in person_split:
        if wordnet.synsets(name):
            if(name in person):
                person_names.remove(person)
                break

print(person_names)

OUTPUT

['Francois R. Velde', 'Richard Branson', 'Economist Paul Krugman', 'Nick Colas']

Apart from Larry Summers all the names are correct and that is because of the last name "Summers".

Hey @EdgarH, should work fine now. `person_names` needed to be initialized after `person_list` — Shivansh bhandari, Jan 02 '20 at 22:50
When dealing with NER with NLTK, such imports are an issue, and 'pip install XYZ' did not work: from nameparser.parser import HumanName AND from nltk.tag.stanford import NERTagger — tursunWali, Mar 24 '21 at 04:04

score 5 · Answer 5 · answered Dec 08 '13 at 23:57

5

You can try to do resolution of the found names, and check if you can find them in a database such as freebase.com. Get the data locally and query it (it's in RDF), or use google's api: https://developers.google.com/freebase/v1/getting-started. Most big companies, geographical locations, etc. (that would be caught by your snippet) could be then discarded based on the freebase data.

answered Dec 08 '13 at 23:57

Viktor Vojnovski

1,191
1
7
19

3

This api has been retired – Anindita Bhowmik Oct 14 '16 at 12:12

Maxmoe · Answer 6 · 2019-07-05T15:33:28.133

I would like to post a brutal and greedy solution here to solve the problem cast by @Enthusiast: get the full name of a person if possible.

The capitalization of the first character in each name is used as a criterion for recognizing PERSON in Spacy. For example, 'jim hoffman' itself won't be recognized as a named entity, while 'Jim Hoffman' will be.

Therefore, if our task is simply picking out persons from a script, we may simply first capitalize the first letter of each word, and then dump it to spacy.

import spacy

def capitalizeWords(text):

  newText = ''

  for sentence in text.split('.'):
    newSentence = ''
    for word in sentence.split():
      newSentence += word+' '
    newText += newSentence+'\n'

  return newText

nlp = spacy.load('en_core_web_md')

doc = nlp(capitalizeWords(rawText))

#......

Note that this approach covers full names at the cost of the increasing of false positives.

OSError: [E050] Can't find model 'en_core_web_md'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory. — Kartik Punjabi, Oct 09 '21 at 13:48
You have to install this by python -m spacy download en_core_web_md — PankajKushwaha, Jan 11 '22 at 11:45

score 1 · Answer 7 · answered Jul 27 '16 at 13:11

This worked pretty well for me. I just had to change one line in order for it to run.

    for subtree in sentt.subtrees(filter=lambda t: t.node == 'PERSON'):

needs to be

    for subtree in sentt.subtrees(filter=lambda t: t.label() == 'PERSON'):

There were imperfections in the output (for example it identified "Money Laundering" as a person), but with my data a name database may not be dependable.

Improving the extraction of human names with nltk

7 Answers7

Prerequesites

Script

Results

Linked