Extracting all Nouns from a text file using nltk

Question

Is there a more efficient way of doing this? My code reads a text file and extracts all Nouns.

import nltk

File = open(fileName) #open file
lines = File.read() #read all lines
sentences = nltk.sent_tokenize(lines) #tokenize sentences
nouns = [] #empty to array to hold all nouns

for sentence in sentences:
     for word,pos in nltk.pos_tag(nltk.word_tokenize(str(sentence))):
         if (pos == 'NN' or pos == 'NNP' or pos == 'NNS' or pos == 'NNPS'):
             nouns.append(word)

How do I reduce the time complexity of this code? Is there a way to avoid using the nested for loops?

Thanks in advance!

Replace the if condition with `if pos.startswith('NN'):` , also use a `set` or `collections.Counter`, don't keep a list. And do some map/reduce instead of a list comprehension. Otherwise, try `shallow parsing`, aka `chunking` — alvas, Nov 07 '15 at 21:10

Aziz Alto · Accepted Answer · 2018-11-10T05:15:04.387

34

If you are open to options other than NLTK, check out TextBlob. It extracts all nouns and noun phrases easily:

>>> from textblob import TextBlob
>>> txt = """Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the inter
actions between computers and human (natural) languages."""
>>> blob = TextBlob(txt)
>>> print(blob.noun_phrases)
[u'natural language processing', 'nlp', u'computer science', u'artificial intelligence', u'computational linguistics']

edited Nov 10 '18 at 05:15

answered Nov 07 '15 at 21:53

Aziz Alto

19,057
5
77
60

You say "It extracts all nouns and noun phrases easily" but I don't see the option to extract nouns only. How could I get nouns alone in your example such as "computer" or "science"? – Sulli Feb 03 '17 at 21:58
2

you could use `blob.tags` to filter out `NN` only something like `[n for n,t in blob.tags if t == 'NN']`. – Aziz Alto Feb 04 '17 at 05:54
1

Personally, I have found that `TextBlob` doesn't perform nearly as well as `nltk` – austin_ce Nov 07 '17 at 23:53
2

The code may be simpler, but `textblob` calls the NLTK to tokenize and tag. This *cannot* reduce the "time complexity" of the OP's code. – alexis Mar 02 '18 at 10:17

Boa · Answer 2 · 2015-11-09T02:40:27.180

30

import nltk

lines = 'lines is some string of words'
# function to test if something is a noun
is_noun = lambda pos: pos[:2] == 'NN'
# do the nlp stuff
tokenized = nltk.word_tokenize(lines)
nouns = [word for (word, pos) in nltk.pos_tag(tokenized) if is_noun(pos)] 

print nouns
>>> ['lines', 'string', 'words']

Useful tip: it is often the case that list comprehensions are a faster method of building a list than adding elements to a list with the .insert() or append() method, within a 'for' loop.

edited Nov 09 '15 at 02:40

answered Nov 07 '15 at 21:18

Boa

2,609
1
23
38

1

The answer is a correct train of thought. Using this is cleaner: `is_noun = lambda pos: True if pos[:2] == 'NN'`. Note: List comprehension is not necessary faster than for loop. It's just that you don't have to materialize a list and deal with nested loops as generators instead of list. – alvas Nov 08 '15 at 10:01
@alvas - I didn't use something like `... pos[:2] == 'NN'...`, because it might match undesirable strings. For all I know, there might be a `pos` that has a value of 'NNA', and we do not want to match that. Strictly speaking, the `True if` and `else False` parts aren't necessary either, but I included them for clarity. Good point about list comprehensions not being necessarily faster than a loop (I guess I was being glib there) - I've edited the post accordingly. – Boa Nov 08 '15 at 17:03
Just out of curiosity, could you give an instance of 'NNA'? It's so that we can do some checks in NLTK on other things not related to this question though =) . Technically, there shouldn't be any tags outside of this tagset: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html – alvas Nov 08 '15 at 22:49
@alvas - The scenario I presented was hypothetical, and the point that I was making was that I didn't know, a priori, what values the 'pos' variable might take (maybe I should have said something like 'NNABCDEFG' instead of 'NNA' to make that notion clearer), so to be safe, I went with the conditional parameters that were presented in the original question. That conditional line, and any other part of the answer I provided can be modified as necessary; I suspect that the performance difference between the 'pos[:2]' variant, and the long conditional that I presented, is pretty marginal. – Boa Nov 09 '15 at 02:15
@alvas - alright - I've edited the post to include your suggestion, to make the answer cleaner. Cheers ;) – Boa Nov 09 '15 at 02:41

score 17 · Answer 3 · answered Oct 26 '18 at 18:15

You can achieve good results using nltk, Textblob, SpaCy or any of the many other libraries out there. These libraries will all do the job but with different degrees of efficiency.

import nltk
from textblob import TextBlob
import spacy
nlp = spacy.load('en')
nlp1 = spacy.load('en_core_web_lg')

txt = """Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages."""

On my windows 10 2 cores, 4 processors, 8GB ram i5 hp laptop, in jupyter notebook, I ran some comparisons and here are the results.

For TextBlob:

%%time
print([w for (w, pos) in TextBlob(txt).pos_tags if pos[0] == 'N'])

And the output is

>>> ['language', 'processing', 'NLP', 'field', 'computer', 'science', 'intelligence', 'linguistics', 'inter', 'actions', 'computers', 'languages']
    Wall time: 8.01 ms #average over 20 iterations

For nltk:

%%time
print([word for (word, pos) in nltk.pos_tag(nltk.word_tokenize(txt)) if pos[0] == 'N'])

And the output is

>>> ['language', 'processing', 'NLP', 'field', 'computer', 'science', 'intelligence', 'linguistics', 'inter', 'actions', 'computers', 'languages']
    Wall time: 7.09 ms #average over 20 iterations

For spacy:

%%time
print([ent.text for ent in nlp(txt) if ent.pos_ == 'NOUN'])

And the output is

>>> ['language', 'processing', 'field', 'computer', 'science', 'intelligence', 'linguistics', 'inter', 'actions', 'computers', 'languages']
    Wall time: 30.19 ms #average over 20 iterations

It seems nltk and TextBlob are reasonably faster and this is to be expected since store nothing else about the input text, txt. Spacy is way slower. One more thing. SpaCy missed the noun NLP while nltk and TextBlob got it. I would shot for nltk or TextBlob unless there is something else I wish to extract from the input txt.

Check out a quick start into spacy here.
Check out some basics about TextBlob here.
Check out nltk HowTos here

SpaCy missed NLP because it find is to be a proper noun (PNOUN). SpaCy us way slower because it has more capabilities, but you can disable the syntactic parser and speed things up quite a bit. — MrE, Jul 09 '19 at 05:48

score 5 · Answer 4 · answered Sep 28 '19 at 17:40

5

import nltk
lines = 'lines is some string of words'
tokenized = nltk.word_tokenize(lines)
nouns = [word for (word, pos) in nltk.pos_tag(tokenized) if(pos[:2] == 'NN')]
print (nouns)

Just simplied abit more.

answered Sep 28 '19 at 17:40

Amit Ghosh

1,500
13
18

score 4 · Answer 5 · answered Nov 07 '15 at 21:13

I'm not an NLP expert, but I think you're pretty close already, and there likely isn't a way to get better than quadratic time complexity in these outer loops here.

Recent versions of NLTK have a built in function that does what you're doing by hand, nltk.tag.pos_tag_sents, and it returns a list of lists of tagged words too.

alexis · Answer 6 · 2018-03-02T10:18:17.450

Your code has no redundancy: You read the file once and visit each sentence, and each tagged word, exactly once. No matter how you write your code (e.g., using comprehensions), you will only be hiding the nested loops, not skipping any processing.

The only potential for improvement is in its space complexity: Instead of reading the whole file at once, you could read it in increments. But since you need to process a whole sentence at a time, it's not as simple as reading and processing one line at a time; so I wouldn't bother unless your files are whole gigabytes long; for short files it's not going to make any difference.

In short, your loops are fine. There are a thing or two in your code that you could clean up (e.g. the if clause that matches the POS tags), but it's not going to change anything efficiency-wise.

Extracting all Nouns from a text file using nltk

6 Answers6

Linked