0

Here's my code that I used to create the termdocumentmatrix object for training data:

text_train = iconv(data_train$SentimentText, "UTF-8", "ASCII", sub = "")
corpus_train = Corpus(VectorSource(text_train))
tdm_train = TermDocumentMatrix(
  corpus_train,
  control = list(
    removePunctuation = TRUE,
    removestopWords   = TRUE,
    stemming = FALSE,
    removeNumbers = TRUE, 
    tolower = TRUE,
    weighting = weightTfIdf)
)

and it works! no complains from the machines.

HOWEVER, when I use the SAME technique to create one for validation data set, the machine complains!

here's my code I used to create the termdocumentmatrix object for validaiton set. notice the ONLY difference is that I added the "dictionary" argument to the control:

text_val = iconv(data_val$SentimentText, "UTF-8", "ASCII", sub = "")
corpus_val = Corpus(VectorSource(text_val))
tdm_val = TermDocumentMatrix(
  corpus_val,
  control = list(
    removePunctuation = TRUE,
    removestopWords   = TRUE,
    stemming = FALSE,
    removeNumbers = TRUE, 
    tolower = TRUE,
    weighting = weightTfIdf,
    dictionary = tdm_train$dimnames$Terms
  )
)

however, I keep getting the following error message:

Error in simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allTerms), : 'i, j, v' different lengths

I've read through many posts, including:

  1. DocumentTermMatrix fails with a strange error only when # terms > 3000
  2. Twitter Data Analysis - Error in Term Document Matrix
  3. twitter data <- error in termdocumentmatrix
  4. Twitter Data Analysis - Error in Term Document Matrix

and I tried ALL of their suggested solutions, but none works.

one note I'd like to add is that the problem only occurs when I use more than about 2000 tweets.

note about the input data:

input data is a data table with two columns, one of which is named "SentimentText" (that you see in my code above).

in this column, each row is one tweet, and each tweet is a text string, aka character().

a sample tweet, aka a row data, look like this: "i had such a wonderful day today! :>"

any help is much much appreciated!

Community
  • 1
  • 1
alwaysaskingquestions
  • 1,595
  • 5
  • 22
  • 49
  • You'd have a look at here: http://stackoverflow.com/questions/21790353/dictionary-is-not-supported-anymore-in-tm-package-how-to-emend-code Best – adrian1121 May 10 '16 at 06:58
  • Hi @adrian1121 thank you for the link! but the "dictionary" mentioned in the answers are not the same dictionary that im talking about here... – alwaysaskingquestions May 10 '16 at 07:01
  • 1
    Would be helpful if you could provide some data to reproduce the error. – lukeA May 10 '16 at 07:07
  • @lukeA I don't think I can provide you a sample data unless I provide you with the 3-5k tweets in a csv format. would you want that? :) because like I mentioned, this problem does not happen with small data set... – alwaysaskingquestions May 10 '16 at 17:50

0 Answers0