Python script to find word frequencies of a given document

Question

I am looking for a simple script that can find frequencies of words for a given document (probably by using portable stemmer).

Is there any library or simple script that does this process?

google for nltk stemming. Or search stackoverflow: http://stackoverflow.com/search?q=[python]+[nltk]+stemmer&submit=search. Post questions here if you get stuck. — Steven Rumbalski, Sep 20 '11 at 04:06
`collections.Counter(i.lower() for i in re.findall(r'\w+', document))` — JBernardo, Sep 20 '11 at 04:07
Dup http://stackoverflow.com/questions/4088265/word-frequency-count-using-python — David Nehme, Sep 20 '11 at 04:08
@JBernardo: Your solution would count "counting" and "counted" as two separate words. A library that uses a stemmer would count them together. — Steven Rumbalski, Sep 20 '11 at 04:12

score 2 · Accepted Answer · answered Sep 20 '11 at 04:11

2

use nltk

import nltk

YOUR_STRING = "Your words"

words = [w for w in YOUR_STRING.split()]
freq_dist = nltk.FreqDist(words)

tokens = freq_dist.keys()

#50 most frequent
most_frequent = tokens[:50]

#50 least frequent
least_frequent = tokens[-50:]

answered Sep 20 '11 at 04:11

MattoTodd

14,467
16
59
76

score 0 · Answer 2 · answered Sep 20 '11 at 04:14

0

You should be able to count words. Use a collections.Counter or a dict, depending on what you need. That part is easy, but if it isn't you can find the answer by searching on SO itself.

I think you also want the Porter Stemmer, which has a Python version at http://tartarus.org/~martin/PorterStemmer/python.txt

answered Sep 20 '11 at 04:14

Roshan Mathews

5,788
2
26
36

More recent versions of the same stemmer are in nltk. See http://code.google.com/p/nltk/source/browse/trunk/nltk/nltk/stem/porter.py. – Steven Rumbalski Sep 20 '11 at 04:47

Python script to find word frequencies of a given document

2 Answers2

Linked

Related