R: error creating a termDocumentMatrix() object

Question

Here's my code that I used to create the termdocumentmatrix object for training data:

text_train = iconv(data_train$SentimentText, "UTF-8", "ASCII", sub = "")
corpus_train = Corpus(VectorSource(text_train))
tdm_train = TermDocumentMatrix(
  corpus_train,
  control = list(
    removePunctuation = TRUE,
    removestopWords   = TRUE,
    stemming = FALSE,
    removeNumbers = TRUE, 
    tolower = TRUE,
    weighting = weightTfIdf)
)

and it works! no complains from the machines.

HOWEVER, when I use the SAME technique to create one for validation data set, the machine complains!

here's my code I used to create the termdocumentmatrix object for validaiton set. notice the ONLY difference is that I added the "dictionary" argument to the control:

text_val = iconv(data_val$SentimentText, "UTF-8", "ASCII", sub = "")
corpus_val = Corpus(VectorSource(text_val))
tdm_val = TermDocumentMatrix(
  corpus_val,
  control = list(
    removePunctuation = TRUE,
    removestopWords   = TRUE,
    stemming = FALSE,
    removeNumbers = TRUE, 
    tolower = TRUE,
    weighting = weightTfIdf,
    dictionary = tdm_train$dimnames$Terms
  )
)

however, I keep getting the following error message:

Error in simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allTerms), : 'i, j, v' different lengths

I've read through many posts, including:

and I tried ALL of their suggested solutions, but none works.

one note I'd like to add is that the problem only occurs when I use more than about 2000 tweets.

note about the input data:

input data is a data table with two columns, one of which is named "SentimentText" (that you see in my code above).

in this column, each row is one tweet, and each tweet is a text string, aka character().

a sample tweet, aka a row data, look like this: "i had such a wonderful day today! :>"

any help is much much appreciated!

You'd have a look at here: http://stackoverflow.com/questions/21790353/dictionary-is-not-supported-anymore-in-tm-package-how-to-emend-code Best — adrian1121, May 10 '16 at 06:58
Hi @adrian1121 thank you for the link! but the "dictionary" mentioned in the answers are not the same dictionary that im talking about here... — alwaysaskingquestions, May 10 '16 at 07:01
Would be helpful if you could provide some data to reproduce the error. — lukeA, May 10 '16 at 07:07
@lukeA I don't think I can provide you a sample data unless I provide you with the 3-5k tweets in a csv format. would you want that? :) because like I mentioned, this problem does not happen with small data set... — alwaysaskingquestions, May 10 '16 at 17:50

R: error creating a termDocumentMatrix() object

0 Answers0