corpus extraction of noun using nltk

Question

Can anyone please tell me how to retrieve noun from the code? Please correct the code if possible. Thanks for the help :)

import nltk
from nltk.corpus import state_union
from textblob import TextBlob
from nltk.tokenize import TweetTokenizer
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import PunktSentenceTokenizer

sample_text=state_union.raw("2006-GWBush.txt")
train_text= state_union.raw("2005-GWBush.txt")
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
tokenized = custom_sent_tokenizer.tokenize(sample_text)

def process_content():
    try:
        for i in tokenized:
            words=nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            if(pos =='NN' or pos == 'NNP' or pos =='NNS' or pos=='NNPS'):
                print(tagged)
    except Exception as e:
        print(str(e))

process_content()

Note: original source of code https://pythonprogramming.net/part-of-speech-tagging-nltk-tutorial/

See also https://stackoverflow.com/questions/49564176/python-nltk-more-efficient-way-to-extract-noun-phrases — alvas, Apr 01 '18 at 01:38

Xiaoxia Lin · Accepted Answer · 2018-03-31T19:05:34.283

For each sentence you get a list of word and its tag (let's call it "pos") with tagged = nltk.pos_tag(words). E.g., for the first sentence

u"PRESIDENT GEORGE W. BUSH'S ADDRESS BEFORE A JOINT SESSION OF THE CONGRESS ON THE STATE OF THE UNION\n \nJanuary 31, 2006\n\nTHE PRESIDENT: Thank you all."

you would get:

[(u'PRESIDENT', 'NNP'), (u'GEORGE', 'NNP'), (u'W.', 'NNP'), (u'BUSH','NNP'), 
(u"'S", 'POS'), (u'ADDRESS', 'NNP'), (u'BEFORE', 'IN'), (u'A', 'NNP'), (u'JOINT', 'NNP'), 
(u'SESSION', 'NNP'), (u'OF', 'IN'), (u'THE', 'NNP'), (u'CONGRESS', 'NNP'), (u'ON', 'NNP'), 
(u'THE', 'NNP'), (u'STATE', 'NNP'), (u'OF', 'IN'), (u'THE', 'NNP'), (u'UNION', 'NNP'), 
(u'January', 'NNP'), (u'31', 'CD'), (u',', ','), (u'2006', 'CD'), (u'THE', 'NNP'), 
(u'PRESIDENT', 'NNP'), (u':', ':'), (u'Thank', 'NNP'), (u'you', 'PRP'), (u'all', 'DT'),
 (u'.', '.')]

If you want to retrieve all the words with pos =='NN' or pos == 'NNP' or pos =='NNS' or pos=='NNPS', you can do

nouns = [word for (word, pos) in tagged if pos in ['NN','NNP','NNS','NNPS']]

Then you would get a list of nouns for each sentence:

[u'PRESIDENT', u'GEORGE', u'W.', u'BUSH', u'ADDRESS', u'A', u'JOINT', u'SESSION', u'THE', u'CONGRESS', u'ON', u'THE', u'STATE', u'THE', u'UNION', u'January', u'THE', u'PRESIDENT', u'Thank']

... Or simply `... if pos in ['NN', 'NNP', 'NNS', 'NNPS']` – tripleee Mar 31 '18 at 18:57 — tripleee, Mar 31 '18 at 18:57
`[word for word, pos in tagged if pos.startswith('NN`)]` – alvas Apr 01 '18 at 01:38 — alvas, Apr 01 '18 at 01:38

corpus extraction of noun using nltk

1 Answers1