1

I'm looking for a way to produce a numerical probability value for single words which share a common root word/meaning.

Users will generate content using words like "dancer," "dancing," "danced."

If "dancer" is submitted 30 times, and dancing 5 times, I need just one value "dance:35" which catches all of them.

But when users also submit words like "accordance," it should not effect my "dance" count, but instead, add to a sperate count along with words like "according" and "accordingly."

Also, I don't have a pre-defined list of root words to look out for. I'll need to create it dynamically based on that user-generated content.

So my question is, what's probably the best way to pull this off? I'm sure there will be no perfect solution, but I figure someone on here can probably can come up with a better way than me.

My idea so far is to assume that most meaningful words will have at least 3 or 4 letters. So for every word I encounter whose length is greater than 4, trim it down to 4 ("dancers" becomes "danc"), check my list of words to see if I've encountered it before, if so - increment it's count, and if not - add it to that list, repeat.

I see there are some similar questions on here. But I haven't found any answer which considers roots AND I can implement in python. Answers seem to be for one or the other.

monkey blot
  • 985
  • 3
  • 10
  • 18
  • 2
    you assume only suffixes. What about prefixes? E.g. - immature, untrue, undone etc. Maybe you could look at http://nltk.org/ – RedBaron Oct 01 '12 at 07:25

2 Answers2

3

What you are looking for is also known as the stems (more technical perspective than the linguistic "root") of the words. You are right in assuming that there is no perfect solution, all approaches will either have imperfect analysis or lack coverage. Basically the best way to go is either with a word-list that contains stems, or a stemming algorithm. Check the first answer here to get a python-based solution:

How do I do word Stemming or Lemmatization?

I am using Snowball in all my Java-based projects and it works perfect for my purposes (it is very fast, too, and covers a wide range of languages). It also seems to have a Python wrapper:

http://snowball.tartarus.org/download.php

Community
  • 1
  • 1
benroth
  • 2,468
  • 3
  • 24
  • 25
3

You don't need a Python wrapper for a Java lib, nltk's got Snowball! :)

>>> from nltk.stem import SnowballStemmer as SS
>>> stemmer = SS('english')
>>> stemmer.stem('dance')
u'danc'
>>> stemmer.stem('danced')
u'danc'
>>> stemmer.stem('dancing')
u'danc'
>>> stemmer.stem('dancer')
u'dancer'
>>> stemmer.stem('accordance')
u'accord'

Stemming isn't always going to give you exact roots, but it's a great start.

Following is an example of using stems. I'm building a dictionary of stem: (word, count) while choosing the shortest word possible for each stem. So ['dancing', 'danced', 'dances', 'dance', 'dancer'] converts to {'danc': ('dance', 4), 'dancer': ('dancer', 1)}

Example Code: (text taken from http://en.wikipedia.org/wiki/Dance)

import re
from nltk.stem import SnowballStemmer as SS

text = """Dancing has evolved many styles. African dance is interpretative.
Ballet, ballroom (such as the waltz), and tango are classical styles of dance
while square dancing and the electric slide are forms of step dances.
More recently evolved are breakdancing and other forms of street dance,
often associated with hip hop culture.
Every dance, no matter what style, has something in common.
It not only involves flexibility and body movement, but also physics.
If the proper physics are not taken into consideration, injuries may occur."""
#extract words
words = [word.lower() for word in re.findall(r'\w+',text)]

stemmer = SS('english')
counts = dict()

#count stems and extract shortest words possible
for word in words:
    stem = stemmer.stem(word)
    if stem in counts:
        shortest,count = counts[stem]
        if len(word) < len(shortest):
            shortest = word
        counts[stem] = (shortest,count+1)
    else:
        counts[stem]=(word,1)

#convert {key: (word, count)} to [(word, count, key)] for convenient sort and print
output = [wordcount + (root,) for root,wordcount in counts.items()]
#trick to sort output by count (descending) & word (alphabetically)
output.sort(key=lambda x: (-x[1],x[0]))
for item in output:
    print '%s:%d (Root: %s)' % item

Output:

dance:7 (Root: danc)
and:4 (Root: and)
are:4 (Root: are)
of:3 (Root: of)
style:3 (Root: style)
the:3 (Root: the)
evolved:2 (Root: evolv)
forms:2 (Root: form)
has:2 (Root: has)
not:2 (Root: not)
physics:2 (Root: physic)
african:1 (Root: african)
also:1 (Root: also)
as:1 (Root: as)
associated:1 (Root: associ)
ballet:1 (Root: ballet)
ballroom:1 (Root: ballroom)
body:1 (Root: bodi)
breakdancing:1 (Root: breakdanc)
---truncated---

I wouldn't recommend lemmatization for your specific needs:

>>> from nltk.stem.wordnet import WordNetLemmatizer
>>> lmtzr = WordNetLemmatizer()
>>> lmtzr.lemmatize('dance')
'dance'
>>> lmtzr.lemmatize('dancer')
'dancer'
>>> lmtzr.lemmatize('dancing')
'dancing'
>>> lmtzr.lemmatize('dances')
'dance'
>>> lmtzr.lemmatize('danced')
'danced'

Substrings are not a good idea because it will always fail at a certain point, and many times fail miserably.

  • fixed-length: pseudo-words 'dancitization' and 'dancendence' will match 'dance' at 4 and 5 characters respectively.
  • ratio: a low ratio will return fakes (like above)
  • ratio: a high ratio will not match enough (e.g. 'running')

But with stemming you get:

>>> stemmer.stem('dancitization')
u'dancit'
>>> stemmer.stem('dancendence')
u'dancend'
>>> #since dancitization gives us dancit, let's try dancization to get danc
>>> stemmer.stem('dancization')
u'dancize'
>>> stemmer.stem('dancation')
u'dancat'

Which is an impressive no-match result for the stem "danc". Even considering 'dancer' does not stem to 'danc', the accuracy is pretty high overall.

I hope this helps you get started.

Anuj Gupta
  • 10,056
  • 3
  • 28
  • 32