8

I am using tm package for text analysis of repair data, Reading data into data frame, converting to Corpus object, applied various methods to clean data using lower, stipWhitespace, removestopwords and so on.

Taken back of Corpus object for stemCompletion.

Performed stemDocument using tm_map function, my object words got stemmed

got results at expected.

When I am running stemCompletion operation using tm_map function, it is not working and got below error

Error in UseMethod("words") : no applicable method for 'words' applied to an object of class "character"

Executed trackback() to show and got steps as below

> traceback()
9: FUN(X[[1L]], ...)
8: lapply(dictionary, words)
7: unlist(lapply(dictionary, words))
6: unique(unlist(lapply(dictionary, words)))
5: FUN(X[[1L]], ...)
4: lapply(X, FUN, ...)
3: mclapply(content(x), FUN, ...)
2: tm_map.VCorpus(c, stemCompletion, dictionary = c_orig)
1: tm_map(c, stemCompletion, dictionary = c_orig)

How can I resolve this error?

eddie_cat
  • 2,527
  • 4
  • 25
  • 43
Sunil
  • 81
  • 1
  • 2

4 Answers4

6

I received the same error when using tm v0.6. I suspect this occurs because stemCompletion is not in the default transformations for this version of the tm package:

>  getTransformations
function () 
c("removeNumbers", "removePunctuation", "removeWords", "stemDocument", 
    "stripWhitespace")
<environment: namespace:tm>

Now, the tolower function has the same problem, but can be made operational by using the content_transformer function. I tried a similar approach for stemCompletion but was not successful.

Note, even though stemCompletion isn't a default transformation, it still works when manually fed stemmed words:

> stemCompletion("compani",dictCorpus)
    compani 
"companies" 

So that I could continue with my work, I manually delimited each document in a corpus by single spaces, feed them through stemCompletion, and concatenated them back together with the following (clunky and not graceful!) function:

stemCompletion_mod <- function(x,dict=dictCorpus) {
  PlainTextDocument(stripWhitespace(paste(stemCompletion(unlist(strsplit(as.character(x)," ")),dictionary=dict, type="shortest"),sep="", collapse=" ")))
}

where dictCorpus is just a copy of the cleaned corpus, but before it's stemmed. The extra stripWhitespace is specific for my corpus, but is likely benign for a general corpus. You may want to change the type option from "shortest" as needed.


For a full example, let's setup a dummy corpus using the crude data in the tm package:

> data("crude")
> docs = Corpus(VectorSource(crude))
> docs <- tm_map(docs, content_transformer(tolower))
> docs <- tm_map(docs, removeNumbers)
> docs <- tm_map(docs, removeWords, stopwords("english"))
> docs <- tm_map(docs, removePunctuation)
> docs <- tm_map(docs, stripWhitespace)
> docs <- tm_map(docs, PlainTextDocument)
> dictCorpus <- docs
> docs <- tm_map(docs, stemDocument)

> # Define modified stemCompletion function
> stemCompletion_mod <- function(x,dict=dictCorpus) {
  PlainTextDocument(stripWhitespace(paste(stemCompletion(unlist(strsplit(as.character(x)," ")),dictionary=dict, type="shortest"),sep="", collapse=" ")))
}

> # Original doc in crude data
> crude[[1]]
<<PlainTextDocument (metadata: 15)>>
Diamond Shamrock Corp said that
effective today it had cut its contract prices for crude oil by
1.50 dlrs a barrel.
    The reduction brings its posted price for West Texas
Intermediate to 16.00 dlrs a barrel, the copany said.
    "The price reduction today was made in the light of falling
oil product prices and a weak crude oil market," a company
spokeswoman said.
    Diamond is the latest in a line of U.S. oil companies that
have cut its contract, or posted, prices over the last two days
citing weak oil markets.
 Reuter

> # Stemmed example in crude data
> docs[[1]]
<<PlainTextDocument (metadata: 7)>>
diamond shamrock corp said effect today cut contract price crude oil dlrs barrel 
reduct bring post price west texa intermedi dlrs barrel copani said price reduct today 
made light fall oil product price weak crude oil market compani spokeswoman said diamond 
latest line us oil compani cut contract post price last two day cite weak oil market reuter

> # Stem comlpeted example in crude data
> stemCompletion_mod(docs[[1]],dictCorpus)
<<PlainTextDocument (metadata: 7)>>
diamond shamrock corp said effect today cut contract price crude oil dlrs barrel 
reduction brings posted price west texas intermediate dlrs barrel NA said price reduction today 
made light fall oil product price weak crude oil market companies spokeswoman said diamond 
latest line us oil companies cut contract posted price last two day cited weak oil market reuter

Note: This example is odd, since the misspelled word "copany" is mapped: -> "copani" -> "NA", in this process. Not sure how to correct this...

To run the stemCompletion_mod through the entire corpus, I just use sapply (or parSapply with snow package).

Perhaps someone with more experience than me could suggest a simpler modification to get stemCompletion to work in v0.6 of the tm package.

cdxsza
  • 151
  • 1
  • 9
  • Hi! I know this is old but I just stumbled upon it, and am getting an error when I run stemCompletion_mod (Error in grep(sprintf("^%s", w), dictionary, value = TRUE) : invalid regular expression, reason 'Missing ')'') Can you explain more how to use sapply or parSapply? Thanks! – Sammie Apr 07 '20 at 08:58
5

I had success with the following workflow:

  1. use content_transformer to apply an anonymous function on each document of the corpus,
  2. split the document to words by spaces,
  3. call stemCompletion on the words with the help of the dictionary,
  4. and concatenate the separate words into a document again with paste.

POC demo code:

tm_map(c, content_transformer(function(x, d)
  paste(stemCompletion(strsplit(stemDocument(x), ' ')[[1]], d), collapse = ' ')), d)

PS: using c as a variable name to store the corpus is not a good idea due to base::c

daroczig
  • 28,004
  • 7
  • 90
  • 124
5

Thanks, cdxsza. Your method worked for me.

A note to all who are going to use stemCompletion:

The function completes an empty string with a word in dictionary, which is unexpected. See an example below, where the first "monday" was produced for the blank at the beginning of the string.

stemCompletion(unlist(strsplit(" mond tues ", " ")), dict=c("monday", "tuesday"))


[1]   "monday"  "monday" "tuesday" 

It can be easily fixed by removing empty string "" before stemCompletion as below.

stemCompletion2 <- function(x, dictionary) {

   x <- unlist(strsplit(as.character(x), " "))

   x <- x[x != ""]

   x <- stemCompletion(x, dictionary=dictionary)

   x <- paste(x, sep="", collapse=" ")

   PlainTextDocument(stripWhitespace(x))

 }

 myCorpus <- lapply(myCorpus, stemCompletion2, dictionary=myCorpusCopy)

 myCorpus <- Corpus(VectorSource(myCorpus))

See a detailed example in page 12 of slides at http://www.rdatamining.com/docs/RDataMining-slides-text-mining.pdf

Regards

Yanchang Zhao

RdataMining.com

Robert
  • 510
  • 1
  • 5
  • 23
Yanchang Zhao
  • 366
  • 3
  • 7
  • This works partially. Sometimes it is removing some of the stemmed terms which is not acceptable. – CodeMonkey Jul 26 '16 at 20:04
  • 1
    it gives me this error `Error in grep(sprintf("^%s", w), dictionary, value = TRUE) : invalid regular expression '^list(sec', reason 'Missing ')''` – Hardik Gupta Jan 16 '17 at 08:40
3

The problem is that using tolower (e.g. myCorpus <- tm_map(myCorpus, tolower)) converts the text to simple character values, which tm version 0.6 does not accept for use with tm_map.

If you instead do your original tolower like this

myCorpus <- tm_map(myCorpus, content_transformer(tolower))

then the data will be in the correct format for when you need stemCompletion.

Other functions like removePunctuation and removeNumbers are used with tm_map as usual, i.e. without content_transformer.

Reference: https://stackoverflow.com/a/24771621

Community
  • 1
  • 1