8

I am new to text processing with R. I'm trying the simple code below

library(RTextTools) texts <- c("This is the first document.", "This is the second file.", "This is the third text.") matrix <- create_matrix(texts,ngramLength=3)

which is one of the answers in the question Finding 2 & 3 word Phrases Using R TM Package

However, it gives an error Error in FUN(X[[2L]], ...) : non-character argument instead.

I can generate a document term matrix when I drop the ngramLength parameter, but I do need to search for phrases of certain word length. Any suggestions of alternative or corrections?

Community
  • 1
  • 1
Ricky
  • 4,616
  • 6
  • 42
  • 72
  • I have this problem as well. I've run a number of text cleaning packages/functions on the text to clean it and it IS character and it looks fine when I inspect it visually. – Hack-R Aug 04 '14 at 19:56
  • 1
    One solution I found online suggested the use of `texts <- textcnt(as.character(df)` before create_matrix but I get the same error. I am going to try to contact the author of this package. – Hack-R Aug 05 '14 at 13:20

3 Answers3

3

ngramLength seems not to work. Here is a workaround:

library(RTextTools)
library(tm)
library(RWeka) # this library is needed for NGramTokenizer
library 
texts <- c("This is the first document.", 
           "Is this a text?", 
           "This is the second file.", 
           "This is the third text.", 
           "File is not this.") 
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
dtm <- DocumentTermMatrix(Corpus(VectorSource(texts)),
                         control=list(
                                      weighting = weightTf,
                                      tokenize = TrigramTokenizer))

as.matrix(dtm)

The tokenizer uses RWeka's NGramTokenizer instead of the tokenizer called by create_matrix. You can now use dtm in the other RTextTools functions, like training a classification model below:

isText <- c(T,F,T,T,F)
container <- create_container(dtm, isText, virgin=F, trainSize=1:3, testSize=4:5)

models=train_models(container, algorithm=c("SVM","BOOSTING"))
classify_models(container, models)
hongsy
  • 1,498
  • 1
  • 27
  • 39
2

I ran into this same error. I found a fix in this pull request https://github.com/timjurka/RTextTools/pull/5/files. I did the change by "trace(create_matrix,edit=T)". Now it works :)

user131476
  • 422
  • 7
  • 20
  • This seems like the correct solution. However that fix doesn't seem to be incorporated into the latest RTextTools in CRAN. How did you implement it? – Ricky Apr 01 '16 at 02:47
  • For now, I am managing it with trace i.e. https://stat.ethz.ch/R-manual/R-devel/library/base/html/trace.html. I am planning to contact the developer or build the package from the source – user131476 Apr 01 '16 at 05:08
  • I don't understand, I thought `trace` is only to debug, do you mean `trace` can be used to replace the part of the code that requires fixing? – Ricky Apr 01 '16 at 09:11
  • Yes, trace is for debug. But it can be used as a temporary workaround to get rid of this issue. Only downside is that the fix which we add using trace gets lost whenever R session is restarted. So fix has to be applied whenever R/RStudio is restarted,So it a ugly workaround, but it works – user131476 Apr 04 '16 at 08:21
  • I downloaded the source and updated this change. Then I installed the package from this updated source. Now it works permanently. – user131476 Apr 05 '16 at 05:58
0

I don't think it is an issue with Character (input data type). Same error when I use the NYTimes dataset , which is provided withe the package and run the same code as accompanied in the help manual.

Ashish M
  • 11
  • 1