0

I'm trying to create a list with the most common 50 words within a specific text file, however I want to eliminate the stop words from that list. I have done that using this code.

from nltk.corpus import gutenberg
carroll = nltk.Text(nltk.corpus.gutenberg.words('carroll-alice.txt'))
carroll_list = FreqDist(carroll)
stops = set(stopwords.words("english"))
filtered_words = [word for word in carroll_list if word not in stops]

However, this is deleting the duplicates of the words I want. Like when I do this:

fdist = FreqDist(filtered_words)
fdist.most_common(50)

I get the output:

 [('right', 1), ('certain', 1), ('delighted', 1), ('adding', 1), 
 ('work', 1),      ('young', 1), ('Up', 1), ('soon', 1), ('use', 1),     
 ('submitted', 1), ('remedies', 1), ('tis', 1), ('uncomfortable', 1)....]

It is saying that there is one instance of each word, clearly it eliminated the duplicates. I want to keep the duplicates so I can see what word is most common. Any help would be greatly appreciated.

Cody
  • 484
  • 6
  • 20
  • Please post a [Minimal, complete, verifiable example](http://stackoverflow.com/help/mcve). Without the original list and other supporting items, we can't reproduce your problem. It appears that you have the filtered words only once each, rather than the full frequency from the original text. – Prune Sep 21 '16 at 21:50

1 Answers1

1

As you have it written now, list is already a distribution containing the words as keys and the occurrence count as the value:

>>> list
FreqDist({u',': 1993, u"'": 1731, u'the': 1527, u'and': 802, u'.': 764, u'to': 725, u'a': 615, u'I': 543, u'it': 527, u'she': 509, ...})

You then iterate over the keys meaning each word is only there once. I believe you actually want to create filtered_words like this:

filtered_words = [word for word in carroll if word not in stops]

Also, you should try to avoid using variable names that match Python builtin functions (list is a Python builtin function).

FamousJameous
  • 1,565
  • 11
  • 25