5

I'm doing some text analysis using tm_map in R. I run the following code (no errors) to produce a Document Term Matrix of (stemmed and otherwise pre-processed) words.

  corpus = Corpus(VectorSource(textVector))
  corpus = tm_map(corpus, tolower)
  corpus = tm_map(corpus, PlainTextDocument) 
  corpus = tm_map(corpus, removePunctuation)
  corpus = tm_map(corpus, removeWords, c(stopwords("english")))
  corpus = tm_map(corpus, stemDocument, language="english")

  dtm = DocumentTermMatrix(corpus)
  mostFreqTerms = findFreqTerms(dtm, lowfreq=125) 

But when I look at my (stemmed) mostFreqTerms, I see a couple that make me think, "hm, what words were stemmed to produce that?" Also, there may be stem words that make sense to me at first glance, but maybe I'm missing the fact that they actually contain words with different meanings.

I'd like to apply the strategy/technique described in this SO answer on retaining specific terms during stemming (e.g. keeping "natural" and "naturalized" from becoming the same stemmed term. Text-mining with the tm-package - word stemming

But to do so most comprehensively, I'd like to see a list of all the separate words that mapped to my most frequent stem words. Is there a way to find the words that, when stemmed, produced my list of mostFreqTerms?

EDIT: REPRODUCIBLE EXAMPLE

textVector = c("Trisha Takinawa: Here comes Mayor Adam West 
               himself. Mr. West do you have any words 
               for our viewers?Mayor Adam West: Box toaster
               aluminum maple syrup... no I take that one 
               back. Im gonna hold onto that one. 
               Now MaxPower is adding adamant
               so this example works")

      corpus = Corpus(VectorSource(textVector))
      corpus = tm_map(corpus, tolower)
      corpus = tm_map(corpus, PlainTextDocument) 
      corpus = tm_map(corpus, removePunctuation)
      corpus = tm_map(corpus, removeWords, c(stopwords("english")))
      corpus = tm_map(corpus, stemDocument, language="english")

      dtm = DocumentTermMatrix(corpus)
      mostFreqTerms = findFreqTerms(dtm, lowfreq=2) 
      mostFreqTerms

...The above mostFreqTerms outputs

[1] "adam" "one" "west"

I'm looking for a programmatic way to determine that the stem word "adam" came from original words "adam" and "adamant".

Community
  • 1
  • 1
Max Power
  • 8,265
  • 13
  • 50
  • 91
  • 2
    I don't know a way to see what particular words in your corpus are being stemmed, but you can look into the lists of equivalents on [`snowball`'s website](http://snowball.tartarus.org/texts/stemmersoverview.html). Here is [the english list](http://snowball.tartarus.org/algorithms/porter/diffs.txt) for instance. – scoa May 02 '15 at 17:43
  • It would help if you provided a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) to test with. – MrFlick May 02 '15 at 18:21
  • hm, someone wrote at the SO link below that "For example "university" and "universal" both become "univers" after stemming and there is nothing you can do to restore it correctly." http://stackoverflow.com/questions/25160521/converting-stemmed-word-to-the-root-word-in-r?rq=1 – Max Power May 02 '15 at 20:21
  • Scoa's link for "english list" appears to get me to a workable solution. For example "adam" and "adamant" in the left column both map to "adam" in the right column. Although that mapping exists independent of whether "adam" and "adamant" were both in my corpus. So for my mostFreqWords, list, I can't see which words I actually mapped to each stemmed word. But I can find all the words that I might have mapped to them. Which is enough to decide if I should implement a process to retain certain words separately before stemming. Thanks Scoa - if you post your link as an answer I'll accept it. – Max Power May 02 '15 at 20:28
  • 1
    [This answer](http://stackoverflow.com/questions/28439522/is-it-possible-to-get-a-natural-word-after-it-has-been-stemmed/28447478#28447478) seems to be what you're looking for. – Qualtagh May 05 '15 at 03:39
  • Hey thanks for the link Qualtagh. The following suggestion from your link is actually what scoa provided with his "english list" link: "As an option: find a dictionary of all words and their forms. Find a stem for every of them. Save this projection as a map: ( stem, list of all word forms ). So you'll be able to get the list of all word forms for a given stem." – Max Power May 05 '15 at 13:11
  • 1
    There's also a reversed version of stemming algorithm in the _update_ part of that answer. It allows to get all the words that produce a given stem without a dictionary (using a rules set). I'm not sure if that's what you need. – Qualtagh May 06 '15 at 03:47
  • What about adding a id field to each sentence? this way you'll have a link between the stemmed sentence and the original one. There's not an actual way to reverse from stemmed word to the original word (this is the purpose of stemming, to reduce the amount of features). – Lior Magen Nov 22 '16 at 08:16

1 Answers1

1

Here you can see that stemmed word 'west' comes from words 'west', 'west', and 'wester'.

import nltk

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import RSLPStemmer
import string 

st = RSLPStemmer()
punctuations = list(string.punctuation)
textVector = "Trisha Takinawa: Here comes Mayor adams West himself. Mr. \
            West do you have any words for our viewers?Mayor Adam Wester: \
    Box toaster aluminum maple syrup... no I take that one back. Im gonna hold \
    onto that one. Now MaxPower is adding adamant so this example works"

tokens = word_tokenize(textVector.lower())
tokens = [w for w in tokens if not w in punctuations]
filtered_words = [w for w in tokens if not w in stopwords.words('english')]
steammed_words = [st.stem(w) for w in filtered_words ]

allWordDist = nltk.FreqDist(w for w in steammed_words)

for w in allWordDist.most_common(2):
    for i in range(len(steammed_words)):
        if steammed_words[i] == w[0]:
            print str(w[0])+"="+ filtered_words[i]

west=west

west=west

west=wester

ad=adams

ad=adam