I'm trying to create a list with the most common 50 words within a specific text file, however I want to eliminate the stop words from that list. I have done that using this code.
from nltk.corpus import gutenberg
carroll = nltk.Text(nltk.corpus.gutenberg.words('carroll-alice.txt'))
carroll_list = FreqDist(carroll)
stops = set(stopwords.words("english"))
filtered_words = [word for word in carroll_list if word not in stops]
However, this is deleting the duplicates of the words I want. Like when I do this:
fdist = FreqDist(filtered_words)
fdist.most_common(50)
I get the output:
[('right', 1), ('certain', 1), ('delighted', 1), ('adding', 1),
('work', 1), ('young', 1), ('Up', 1), ('soon', 1), ('use', 1),
('submitted', 1), ('remedies', 1), ('tis', 1), ('uncomfortable', 1)....]
It is saying that there is one instance of each word, clearly it eliminated the duplicates. I want to keep the duplicates so I can see what word is most common. Any help would be greatly appreciated.