I have a sentence as follow:
txt = "i am living in the West Bengal and my brother live in New York. My name is John Smith"
What I need is:
- Get the Chunks With GPE/location as labels and combine these chunks using "_"
- Get the Chunks With PERSON label and remove those chunks.
Output I needed:
preprocessed_txt = "i am living in the West_Bengal and my brother live in New_York. My name is "
I use code from NLTK Named Entity recognition to a Python list to get the labels of the chunks.
import nltk
for sent in nltk.sent_tokenize(sentence):
for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
if hasattr(chunk, 'label'):
print(chunk.label(), '_'.join(c[0] for c in chunk))
This returned me the output as:
LOCATION West_Bengal
GPE New_York
PERSON John_Smith
What to do next?