2

I am using RTextTools to build a training set with a matrix and a model which I will later apply to different documents to classify them.

EDIT: The matrix is a Document Term Matrix

The problem I am having is that sometimes with certain documents when I create the new_matrix with the following line

new_matrix <- create_matrix(data$document,language="english", removeNumbers=FALSE, removePunctuation=TRUE, removeStopwords=TRUE, toLower=TRUE, stemWords=TRUE, minDocFreq=1,weighting=weightTfIdf,originalMatrix=matrix)

I get some NaN values which make my corpus fail

corpus <- create_corpus(new_matrix,data$value, testSize=1:100,virgin=FALSE)

With the error

Error in .csr.coo(x) : NA/NaN/Inf in foreign function call (arg 4)

I am not sure why there are some NaN values. My guess is that it has to do with some words being present on the new_matrix and not on the original matrix.

How can I change NaN values for a 0 in the resulting matrix?

Will doing that alter the result of the classification?

Any help much appreciated! Thanks!

JordanBelf
  • 3,208
  • 9
  • 47
  • 80
  • Related: [R substitute NAs in a matrix](http://stackoverflow.com/q/11140650/271616). – Joshua Ulrich Jun 21 '12 at 19:49
  • Thanks Joshua, that works for a matrix but not for a document term matrix – JordanBelf Jun 21 '12 at 20:10
  • 2
    A [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) would help. Failing that, you can just look at the contents of the matrix (`str(new_matrix)`), notice it is just a list of positions and values, and remove the offending ones (`m <- new_matrix; i <- is.finite(m$v); m$i <- m$i[i]; m$j <- m$j[i]; m$v <- m$v[i]`). – Vincent Zoonekynd Jun 21 '12 at 23:11
  • Thanks Vincent! You gave me an idea. I noticed after using `str(new_matrix)` that the `NaN` values where all in `new_matrix$v` with that I run the code provided by DWAHL and now I can modify the `NaN` values for a `0`. I have yet to understand if that alters the results of the machine learning algorithms but its a good step. Thanks again! – JordanBelf Jun 21 '12 at 23:33

2 Answers2

3

Simple way to find NaN values by using is.na():

data<-c(1,2,NaN,4,2)
data[is.na(data)]<-0
data

[1] 1 2 0 4 2

DWAHL
  • 156
  • 6
  • Thanks, I tried that but it is not working for my matrix, the output after running it is the same. Here is my code [new_matrix[is.na(new_matrix)]<-0.It seems to work fine with vectors. – JordanBelf Jun 21 '12 at 20:00
  • correction, it seems to work with matrices but not with document term matrices – JordanBelf Jun 21 '12 at 20:08
0

I'm the lead developer of RTextTools, and would really appreciate if you could send me an example of this error. The originalMatrix parameter was introduced within the past two months, and there may be some ongoing issues with how it is processed. You can drop me an email on my website (http://www.timjurka.com/)

Timothy P. Jurka
  • 918
  • 1
  • 11
  • 21