Python and NLTK noob here. Messing around with something.
I have a string which contains text from a pdf document and I'm trying to extract entity names using the nltk library
with open(filename, 'r') as f:
str_output = f.readlines()
str_output = clean_str(str(str_output))
sentences = nltk.sent_tokenize(str_output)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=True)
I went through the steps of importing the data, cleaning the string, and preprocessing the strings. How does one go about getting different entity names from the string?