NLTK: Extract the entity name from a string

Question

Python and NLTK noob here. Messing around with something.

I have a string which contains text from a pdf document and I'm trying to extract entity names using the nltk library

with open(filename, 'r') as f:
    str_output = f.readlines()   

str_output = clean_str(str(str_output))

sentences = nltk.sent_tokenize(str_output)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=True)

I went through the steps of importing the data, cleaning the string, and preprocessing the strings. How does one go about getting different entity names from the string?

You're iterating through the sentence multiple times. Don't do that. — alvas, Jul 18 '18 at 23:29

score 1 · Answer 1 · answered Jul 18 '18 at 15:57

This should work:

import nltk

with open('sample.txt', 'r') as f:
    sample = f.read()

sentences = nltk.sent_tokenize(sample)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=True)

def extract_entity_names(t):
    entity_names = []

    if hasattr(t, 'node') and t.node:
        if t.node == 'NE':
            entity_names.append(' '.join([child[0] for child in t]))
        else:
            for child in t:
                entity_names.extend(extract_entity_names(child))

    return entity_names

entity_names = []
for tree in chunked_sentences:
    # Print results per sentence
    # print extract_entity_names(tree)

    entity_names.extend(extract_entity_names(tree))

# Print all entity names
#print entity_names

# Print unique entity names
print set(entity_names)

Don't encourage the questioner by reusing the code that iterates through the data multiple times. — alvas, Jul 18 '18 at 23:30

NLTK: Extract the entity name from a string

1 Answers1