Fdist and top 10 function words

Question

I have to write a script that will give me all content words in decending order of frequency. I need the 10 most frequent content words, I thus not only need to make a list of the 10 most frequent words of my corpus, I will also need to filter out any content words (and, or, any punctuation...). What I have so far is the following

fileids=corpus.fileids ()
text=corpus.words(fileids)
wlist=[]
ftable=nltk.FreqDist (text)
wlist.append(ftable.keys () )

This gives me a very neat list of all words in decending order of frequency, but how do I filter the function words out?

Thank you.

score 1 · Accepted Answer · edited May 23 '17 at 12:03

1

You want to filter out a set of words (stopwords). Taking the core idea from this SO answer:

You need to introduce a couple of lines into your code: Just after

fileids=corpus.fileids ()
text=corpus.words(fileids)

Add the following lines: Create a list of stopwords and filter them out from your text

#get a list of the stopwords
stp = nltk.corpus.stopwords.words('english')

#from your text of words, keep only the ones NOT in stp
filtered_text = [w for w in text if not w in stp]

Now, continue as you would

wlist=[]
ftable=nltk.FreqDist (filtered_text)
wlist.append(ftable.keys () )

Hope that helps.

edited May 23 '17 at 12:03

Community

1
1

answered Jan 23 '13 at 21:04

Ram Narasimhan

22,341
5
49
55

Why, I didn't know NLTK had a built in list of stopwords, thanks a million – Shifu Jan 23 '13 at 23:53
1

Yes, NLTK is a fantastic resource, and I am always discovering new treasures in it. – Ram Narasimhan Jan 24 '13 at 20:22

Fdist and top 10 function words

1 Answers1