2

when i try to apply stemCompletion to a corpus , this function generates NA values..

this is my code:

my.corpus <- tm_map(my.corpus, removePunctuation) 
my.corpus <- tm_map(my.corpus, removeWords, stopwords("english")) 

(one result of this is: [[2584]] zoning plan )

the next step is stamming corpus and so:

my.corpus <- tm_map(my.corpus, stemDocument, language="english")
my.corpus <- tm_map(my.corpus, stemCompletion, dictionary=my.corpus_copy, type="first")

but result is this

[[2584]] NA plant

the next step should be the creation of an incidence matrix with transactions and then apriori rules but if i go on and try to get rules, the inspect(rules) function gives me this error:

> inspect(rules)
Errore in UseMethod("inspect", x) : 
no applicable method for 'inspect' applied to an object of class "c('rules','associations')"

what's the problem? i suppose that NA values don't generate correctly the incidence matrix and then good rules.. is this the problem? if so how i can solve it?

this is an abstract of the problem:

this is an abstract:

my.words = c("β cell","zoning policy regional index brazil","zoning plan","zolpidem  adult","zizyphus spinosa hu")
my.corpus = Corpus(VectorSource(my.words))
my.corpus_copy = my.corpus
my.corpus = tm_map(my.corpus, removePunctuation)
my.corpus = tm_map(my.corpus, removeWords, c("the", stopwords("english"))) 
my.corpus = tm_map(my.corpus, stemDocument, language="english")
my.corpus <- tm_map(my.corpus, stemCompletion, dictionary=my.corpus_copy, type="first")
inspect(my.corpus)
ntrax
  • 457
  • 4
  • 22
  • 2
    Could you kindly provide a reproducible example which we can copy/paste into R and run, please? – Tony Breyal Sep 13 '13 at 09:14
  • i have added an abstract of code in the main post – ntrax Sep 13 '13 at 09:33
  • How about using the corpus itself rather than it's unmodified copy? This works for me in terms of removing the NA at least (not really an answer but at least it's something till someone comes up with something better): tm_map(my.corpus, stemCompletion, dictionary=my.corpus, type="first") – Tony Breyal Sep 13 '13 at 09:39
  • thanks for the help, this solves the NA! but i'm still have problem of inspect() i cant't do inspect on incidence.matrix (no applicable method for 'inspect' applied to an object of class "c('matrix', 'double', 'numeric')") and the same to inspect(rules) – ntrax Sep 13 '13 at 11:06
  • Do you have the code leading up to 'inspect' code as I don't know how else to reproduce your error. – Tony Breyal Sep 15 '13 at 10:37

1 Answers1

2

stemCompletion() at this moment is only an approximate reversal of stemming process if original corpus is used as a dictionary parameter. Using grep() it searches in the dictionary all the words, which contain current stemmed word and then uses one of these for completion based upon the ‘type’.

Thus it fails in cases where stemming process returned words which are not substrings of the un-stemmed words. For example, stems of ‘c('delivery’, 'zoning') are c('deliveri', 'zone') as returned by wordStem() used in stemDocument(). However, in both of these cases, stemmed words are not proper substrings of the un-stemmed words. Therefore, stemCompletion() would not find any replacement and would return NA.

There are many alternatives to overcome this problem including replacing NAs with stemmed-words after returning from stemCompletion() or better modifying the stemCompletion() function itself. A simple way to modify it so that instead of NA it retains the stemmed-word is to have your own version of it stemCompletion_modified(): (replace ... with original code from stemCompletion() function in tm package)

stemCompletion_modified <- function (x, dictionary, type = ...) 
{
  ...
  #possibleCompletions <- lapply(x, function(w) grep(sprintf("^%s", w), dictionary, value = TRUE))
  possibleCompletions <- lapply(x, function(w) ifelse(identical(grep(sprintf("^%s", w), dictionary, value = TRUE),character(0)),w,grep(sprintf("^%s", w), dictionary, value = TRUE)))
  ...
}