0

I am trying to write a function which returns the stem map of words when a text is made to undergo porter stemming. When I tried to run an example, the code wouldn't stop running, i.e no output came. There was no error, but when I force stopped it, it gave warnings like:

1: In stemList[length(stemList) + 1][2] <- flatText[i] :
  number of items to replace is not a multiple of replacement length
2: In stemList[length(stemList) + 1][2] <- flatText[i] :
  number of items to replace is not a multiple of replacement length
3: In stemList[length(stemList) + 1][2] <- flatText[i] :
  number of items to replace is not a multiple of replacement length
4: In stemList[length(stemList) + 1][2] <- flatText[i] :
  number of items to replace is not a multiple of replacement length
5: In stemList[length(stemList) + 1][2] <- flatText[i] :
  number of items to replace is not a multiple of replacement length
6: In stemList[length(stemList) + 1][2] <- flatText[i] :
  number of items to replace is not a multiple of replacement length
7: In stemList[length(stemList) + 1][2] <- flatText[i] :
  number of items to replace is not a multiple of replacement length
8: In stemList[length(stemList) + 1][2] <- flatText[i] :
  number of items to replace is not a multiple of replacement length
9: In stemList[length(stemList) + 1][2] <- flatText[i] :
  number of items to replace is not a multiple of replacement length

My code is as follows:

stemMAP<-function(text){
  flatText<-unlist(strsplit(text," "))
  textLength<-length(flatText)

  stemList<-list(NULL)
  for(i in 1:textLength){
    wordStem<-SnowballStemmer(flatText[i])
    flagStem=0
    flagWord=0

    for(j in 1:length(stemList)){
      if(regexpr(wordStem,stemList[j][1])==TRUE){

        for(k in 1:length(stemList[j])){
          if(regexpr(flatText[i],stemList[j][k])==TRUE){ 
            flagWord=1
            #break;
            }
         }

        if(flagWord==0){
          stemList[j][length(stemList[j])+1]<-flatText[i]
          #break;
        }

        flagStem=1

      }

      if(flagStem==0){
        stemList[length(stemList)+1][1]<-wordStem
        stemList[length(stemList)+1][2]<-flatText[i]
      }

    }

  }

  return(stemList)
}

How can I identify the mistakes? My test statement was:

stem<-stemMAP("I like being active and playing because when you play it activates your body and this activation leads to a good health")
jackStinger
  • 2,035
  • 5
  • 23
  • 36
  • You code is not reproducible. We don't have SnowballStemmer . Try to set options(warn=2) and re-run , this will turn your warnings to errors. – agstudy Dec 27 '12 at 09:50
  • it needs the package Snowball and dependencies are tm, RWeka, rJava. Also, warn didn't work, it gives unused args() error. – jackStinger Dec 27 '12 at 10:05
  • Your code is confuding. stemList<-list(NULL), here you put a NULL in the first element.? why ? myabe you want stemList <- NULL? stemList[length(stemList)+1][1]<-wordStem , first time your evaluate this , it will be stemList[2][1] , no sense for a list. What do you expect to have in stemList? – agstudy Dec 27 '12 at 10:17
  • 1
    http://stackoverflow.com/questions/4442518/general-suggestions-for-debugging-r/5156351#5156351 – Ari B. Friedman Dec 27 '12 at 10:24
  • @agstudy tried it with list(). didn't work. Figured the reason could be the length is 0. then changed for(j in 1:length(stemList)) to for(j in 1:length(stemList)+1) to avoid it. no good. – jackStinger Dec 27 '12 at 10:38
  • @jackStinger What do you expect to have in stemList? – agstudy Dec 27 '12 at 10:49
  • @agstudy stemList is a list of lists. Each sub-list has the first element as the stem and subsequent elements as the words which get mapped to that stem. for eg, I'd like [[1]][[1]]activ [[1]][[2]]active [[1]][[3]]activates [[1]][[4]]activation [[2]][[1]]play [[2]][[2]]playing [[2]][[1]]play etc. I left out the non-imp words, but the algo need not. – jackStinger Dec 27 '12 at 10:53

1 Answers1

5

Here I rewrite your code using the vectorize version of SnowballStemmer. No need to use for.

library(plyr)   
stemMAP<-function(text){
  flatText <- unlist(strsplit(text," "))
  ## here I use the vectorize version
  wordStem <- as.character(SnowballStemmer(flatText))
  hh <- data.frame(ff = flatText,sn = wordStem)
  ## I use plyr to transform the result to a list
  ## dlply : data.frame to list apply
  ## we group the hh by the column sn , and a apply the 
  ## function as.character(x$ff) to each group( x here is subset data.fame)
  stemList <- dlply(hh,.(sn),function(x) as.character(x$ff))
  stemList
}

stemList
$I
[1] "I"

$a
[1] "a"

$activ
[1] "active"     "activates"  "activation"

$and
[1] "and" "and"

$be
[1] "being"
agstudy
  • 119,832
  • 17
  • 199
  • 261
  • It worked excellent. I just need to remove the attributes split_type & labels. I extracted the stems using names(), so that works great too. However, I have very limited understanding of what you did here, esp dlply(). It'd be a great favor if you could help explain it to me! thanks anyway, you were of great help! – jackStinger Dec 27 '12 at 12:20
  • @jackStinger I dd some explanation. for more info see ?ddply, ?dlply of plyr package. – agstudy Dec 27 '12 at 12:25
  • ah. thanks for the explanation. Made stuff much clearer. This just rolled up the data, in simple words, right! thanks! – jackStinger Dec 27 '12 at 12:29