Dropping specific words out of an NLTK distribution beyond stopwords

Question

I have a simple sentence like so. I want to drop the prepositions and words such as A and IT out of the list. I looked through the Natural Language Toolkit (NLTK) documentation, but I can't find anything. Can someone show me how? Here is my code:

import nltk
from nltk.tokenize import RegexpTokenizer
test = "Hello, this is my sentence. It is a very basic sentence with not much information in it"
test = test.upper()
tokenizer = RegexpTokenizer(r'\w+')
tokens = tokenizer.tokenize(test)
fdist = nltk.FreqDist(tokens)
common = fdist.most_common(100)

possible duplicate of [Stopword removal with NLTK](http://stackoverflow.com/questions/19130512/stopword-removal-with-nltk) — alvas, Aug 05 '15 at 11:39

b3000 · Answer 1 · 2015-08-05T09:38:15.937

7

Might stopwords be the solution you're looking for?

You can filter them quite easily from the tokenized text:

from nltk.probability import FreqDist
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords

en_stopws = stopwords.words('english')  # this loads the default stopwords list for English
en_stopws.append('spam')  # add any words you don't like to the list

test = "Hello, this is my sentence. It is a very basic sentence with not much information in it but a lot of spam"
test = test.lower()  # I changed it to lower(), since stopwords are all lower case
tokenizer = RegexpTokenizer(r'\w+')
tokens = tokenizer.tokenize(test)
tokens = [token for token in tokens if token not in en_stopws]  # filter stopwords
fdist = FreqDist(tokens)
common = fdist.most_common(100)

I didn't find a nice way to delete entries from the FreqDist if you find something let me know.

edited Aug 05 '15 at 09:38

answered Aug 05 '15 at 09:08

b3000

1,547
1
15
27

I'm getting a trace back error... `File "C:\Python27\lib\site-packages\nltk\data.py", line 293, in __init__ raise IOError('No such file or directory: %r' % _path) IOError: No such file or directory: u'C:\\Users\\jason\\AppData\\Roaming\\nltk_data\\corpora\\stopwords\\IS'` using the word `IS` , but i see your approach to filter before it goes in – jason Aug 05 '15 at 09:28
1

@jason_cant_code I think you misunderstood the loading of the stopword corpus. I edited and tried to make it a bit clearer. Also check the [book](http://www.nltk.org/book/ch02.html#wordlist-corpora) for more information – b3000 Aug 05 '15 at 09:39

score 3 · Accepted Answer · answered Aug 05 '15 at 11:35

Essentially, nltk.probability.FreqDist is a collections.Counter object (https://github.com/nltk/nltk/blob/develop/nltk/probability.py#L61). Given a dictionary object, there are several way to filter it:

1. Read into a FreqDist and filter it with a lambda function

>>> import nltk
>>> text = "Hello, this is my sentence. It is a very basic sentence with not much information in it"
>>> tokenized_text = nltk.word_tokenize(text)
>>> stopwords = nltk.corpus.stopwords.words('english')
>>> word_freq = nltk.FreqDist(tokenized_text)
>>> dict_filter = lambda word_freq, stopwords: dict( (word,word_freq[word]) for word in word_freq if word not in stopwords )
>>> filtered_word_freq = dict_filter(word_freq, stopwords)
>>> len(word_freq)
17
>>> len(filtered_word_freq)
8
>>> word_freq
FreqDist({'sentence': 2, 'is': 2, 'a': 1, 'information': 1, 'this': 1, 'with': 1, 'in': 1, ',': 1, '.': 1, 'very': 1, ...})
>>> filtered_word_freq
{'information': 1, 'sentence': 2, ',': 1, '.': 1, 'much': 1, 'basic': 1, 'It': 1, 'Hello': 1}

2. Read into a FreqDist and filter it with dictionary comprehension

>>> word_freq
FreqDist({'sentence': 2, 'is': 2, 'a': 1, 'information': 1, 'this': 1, 'with': 1, 'in': 1, ',': 1, '.': 1, 'very': 1, ...})
>>> filtered_word_freq = dict((word, freq) for word, freq in word_freq.items() if word not in stopwords)
>>> filtered_word_freq 
{'information': 1, 'sentence': 2, ',': 1, '.': 1, 'much': 1, 'basic': 1, 'It': 1, 'Hello': 1}

3. Filter the words before reading into a FreqDist

>>> import nltk
>>> text = "Hello, this is my sentence. It is a very basic sentence with not much information in it"
>>> tokenized_text = nltk.word_tokenize(text)
>>> stopwords = nltk.corpus.stopwords.words('english')
>>> filtered_tokenized_text = [word for word in tokenized_text if word not in stopwords]
>>> filtered_word_freq = nltk.FreqDist(filtered_tokenized_text)
>>> filtered_word_freq
FreqDist({'sentence': 2, 'information': 1, ',': 1, 'It': 1, '.': 1, 'much': 1, 'basic': 1, 'Hello': 1})

didn't quite understand in step 1 how did the second filtering reduce the word count. i thought both the filters are removing stop words from the dict of unique words right? — Chaitanya Bapat, Apr 01 '19 at 18:38

Dropping specific words out of an NLTK distribution beyond stopwords

2 Answers2

Linked