You don't need a Python wrapper for a Java lib, nltk's got Snowball! :)
>>> from nltk.stem import SnowballStemmer as SS
>>> stemmer = SS('english')
>>> stemmer.stem('dance')
u'danc'
>>> stemmer.stem('danced')
u'danc'
>>> stemmer.stem('dancing')
u'danc'
>>> stemmer.stem('dancer')
u'dancer'
>>> stemmer.stem('accordance')
u'accord'
Stemming isn't always going to give you exact roots, but it's a great start.
Following is an example of using stems. I'm building a dictionary of stem: (word, count)
while choosing the shortest word possible for each stem. So ['dancing', 'danced', 'dances', 'dance', 'dancer'] converts to {'danc': ('dance', 4), 'dancer': ('dancer', 1)}
Example Code: (text taken from http://en.wikipedia.org/wiki/Dance)
import re
from nltk.stem import SnowballStemmer as SS
text = """Dancing has evolved many styles. African dance is interpretative.
Ballet, ballroom (such as the waltz), and tango are classical styles of dance
while square dancing and the electric slide are forms of step dances.
More recently evolved are breakdancing and other forms of street dance,
often associated with hip hop culture.
Every dance, no matter what style, has something in common.
It not only involves flexibility and body movement, but also physics.
If the proper physics are not taken into consideration, injuries may occur."""
#extract words
words = [word.lower() for word in re.findall(r'\w+',text)]
stemmer = SS('english')
counts = dict()
#count stems and extract shortest words possible
for word in words:
stem = stemmer.stem(word)
if stem in counts:
shortest,count = counts[stem]
if len(word) < len(shortest):
shortest = word
counts[stem] = (shortest,count+1)
else:
counts[stem]=(word,1)
#convert {key: (word, count)} to [(word, count, key)] for convenient sort and print
output = [wordcount + (root,) for root,wordcount in counts.items()]
#trick to sort output by count (descending) & word (alphabetically)
output.sort(key=lambda x: (-x[1],x[0]))
for item in output:
print '%s:%d (Root: %s)' % item
Output:
dance:7 (Root: danc)
and:4 (Root: and)
are:4 (Root: are)
of:3 (Root: of)
style:3 (Root: style)
the:3 (Root: the)
evolved:2 (Root: evolv)
forms:2 (Root: form)
has:2 (Root: has)
not:2 (Root: not)
physics:2 (Root: physic)
african:1 (Root: african)
also:1 (Root: also)
as:1 (Root: as)
associated:1 (Root: associ)
ballet:1 (Root: ballet)
ballroom:1 (Root: ballroom)
body:1 (Root: bodi)
breakdancing:1 (Root: breakdanc)
---truncated---
I wouldn't recommend lemmatization for your specific needs:
>>> from nltk.stem.wordnet import WordNetLemmatizer
>>> lmtzr = WordNetLemmatizer()
>>> lmtzr.lemmatize('dance')
'dance'
>>> lmtzr.lemmatize('dancer')
'dancer'
>>> lmtzr.lemmatize('dancing')
'dancing'
>>> lmtzr.lemmatize('dances')
'dance'
>>> lmtzr.lemmatize('danced')
'danced'
Substrings are not a good idea because it will always fail at a certain point, and many times fail miserably.
- fixed-length: pseudo-words 'dancitization' and 'dancendence' will match 'dance' at 4 and 5 characters respectively.
- ratio: a low ratio will return fakes (like above)
- ratio: a high ratio will not match enough (e.g. 'running')
But with stemming you get:
>>> stemmer.stem('dancitization')
u'dancit'
>>> stemmer.stem('dancendence')
u'dancend'
>>> #since dancitization gives us dancit, let's try dancization to get danc
>>> stemmer.stem('dancization')
u'dancize'
>>> stemmer.stem('dancation')
u'dancat'
Which is an impressive no-match result for the stem "danc". Even considering 'dancer' does not stem to 'danc', the accuracy is pretty high overall.
I hope this helps you get started.