15

I got the question from here with my changes. I have following code:

from nltk.corpus import stopwords
def content_text(text):
    stopwords = nltk.corpus.stopwords.words('english')
    content = [w for w in text if w.lower() in stopwords]
    return content

How can I print the 10 most frequently occurring words of a text that 1)including and 2)excluding stopwords?

user2064809
  • 403
  • 1
  • 4
  • 13
  • possible duplicate of [How can I count the occurrences of a list item in Python?](http://stackoverflow.com/questions/2600191/how-can-i-count-the-occurrences-of-a-list-item-in-python) – ivan_pozdeev Feb 08 '15 at 11:18

3 Answers3

23

There is a FreqDist function in nltk

import nltk
allWords = nltk.tokenize.word_tokenize(text)
allWordDist = nltk.FreqDist(w.lower() for w in allWords)

stopwords = nltk.corpus.stopwords.words('english')
allWordExceptStopDist = nltk.FreqDist(w.lower() for w in allWords if w not in stopwords)    

to extract 10 most common:

mostCommon= allWordDist.most_common(10).keys()
igorushi
  • 1,855
  • 21
  • 19
  • I get this error: AttributeError: 'FreqDist' object has no attribute 'most_common' – user2064809 Feb 08 '15 at 14:46
  • Can you please provide full listing? – igorushi Feb 08 '15 at 20:17
  • 2
    You should ask stopwords with strings in lowercase. From: `allWordExceptStopDist = nltk.FreqDist(w.lower() for w in allWords if w not in stopwords) ` To: `allWordExceptStopDist = nltk.FreqDist(w.lower() for w in allWords if w.lower() not in stopwords)` – abevieiramota May 10 '17 at 13:26
5

Not sure on the is stopwords in the function, I imagine it needs to be in but you can use a Counterdict with most_common(10) to get the 10 most frequent:

from collections import Counter
from string import punctuation


def content_text(text):
    stopwords = set(nltk.corpus.stopwords.words('english')) # 0(1) lookups
    with_stp = Counter()
    without_stp  = Counter()
    with open(text) as f:
        for line in f:
            spl = line.split()
            # update count off all words in the line that are in stopwrods
            with_stp.update(w.lower().rstrip(punctuation) for w in spl if w.lower() in stopwords)
               # update count off all words in the line that are not in stopwords
            without_stp.update(w.lower().rstrip(punctuation)  for w in spl if w  not in stopwords)
    # return a list with top ten most common words from each 
    return [x for x in with_stp.most_common(10)],[y for y in without_stp.most_common(10)]
wth_stop, wthout_stop = content_text(...)

If you are passing in an nltk file object just iterate over it:

def content_text(text):
    stopwords = set(nltk.corpus.stopwords.words('english'))
    with_stp = Counter()
    without_stp  = Counter()
    for word in text:
        # update count off all words in the line that are in stopwords
        word = word.lower()
        if word in stopwords:
             with_stp.update([word])
        else:
           # update count off all words in the line that are not in stopwords
            without_stp.update([word])
    # return a list with top ten most common words from each
    return [k for k,_ in with_stp.most_common(10)],[y for y,_ in without_stp.most_common(10)]

print(content_text(nltk.corpus.inaugural.words('2009-Obama.txt')))

The nltk method includes punctuation so that may not be what you want.

Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
  • when I write `wth_stop, wthout_stop = content_text(nltk.corpus.inaugural.words('2009-Obama.txt'))` I get error. – user2064809 Feb 08 '15 at 11:37
  • 1
    @user2064809, I tested it and it works fine for me, what error are you getting? – Padraic Cunningham Feb 08 '15 at 11:39
  • TypeError: coercing to Unicode: need string or buffer, StreamBackedCorpusView found – user2064809 Feb 08 '15 at 11:41
  • what should I put exactly inside `content_text()` function? – user2064809 Feb 08 '15 at 11:56
  • just put `'2009-Obama.txt'` – Padraic Cunningham Feb 08 '15 at 11:58
  • @user2064809: You might have to make some changes to the code if you are using python 2. Also, if need help understanding an error message, you need to provide all of it. It's much easier to understand when we know exactly where in the script the exception was raised. – Håken Lid Feb 08 '15 at 12:28
  • I was just guessing based on the error message. A lot of functions that used to return ascii strings or simple structures such as lists in python 2, will return unicode and more complicated, but efficient iterators such as Views in python 3. Could it perhaps be caused by a different version of nltk? – Håken Lid Feb 08 '15 at 12:36
  • @HåkenLid, the error was because the OP was passing an nltk object to the first function instead of just a file name – Padraic Cunningham Feb 08 '15 at 12:37
  • That's all good then. Are you sure the last line in your snippet will run properly, though? There's two closing parentheses missing, and you'll probably have to import `print_function` with python 2. – Håken Lid Feb 08 '15 at 12:41
  • @HåkenLid, that was just a typo from copy/pasting. There is no need to import `print_function`. – Padraic Cunningham Feb 08 '15 at 12:42
  • It works! thank you. I should put 'computer address' in first code: `wth_stop, wthout_stop = content_text('C:\\Documents and Settings\\Application Data\\nltk_data\\corpora\\inaugural\\2009-Obama.txt')` instead of `nltk.corpus.inaugural.words('2009-Obama.txt')`. But in the second code the `print(content_text(nltk.corpus.inaugural.words('2009-Obama.txt')))` works!! – user2064809 Feb 08 '15 at 12:49
  • @user2064809, I was not sure what exactly you were passing as text so I just added a way to use a normal file and an nltk file object. The first example will work for any file. just pass the path to the file. – Padraic Cunningham Feb 08 '15 at 13:12
1

You can try this:

for word, frequency in allWordsDist.most_common(10):
    print('%s;%d' % (word, frequency)).encode('utf-8')
João Almeida
  • 4,487
  • 2
  • 19
  • 35
prahlad
  • 41
  • 2