Here's my code that I used to create the termdocumentmatrix object for training data:
text_train = iconv(data_train$SentimentText, "UTF-8", "ASCII", sub = "")
corpus_train = Corpus(VectorSource(text_train))
tdm_train = TermDocumentMatrix(
corpus_train,
control = list(
removePunctuation = TRUE,
removestopWords = TRUE,
stemming = FALSE,
removeNumbers = TRUE,
tolower = TRUE,
weighting = weightTfIdf)
)
and it works! no complains from the machines.
HOWEVER, when I use the SAME technique to create one for validation data set, the machine complains!
here's my code I used to create the termdocumentmatrix object for validaiton set. notice the ONLY difference is that I added the "dictionary" argument to the control:
text_val = iconv(data_val$SentimentText, "UTF-8", "ASCII", sub = "")
corpus_val = Corpus(VectorSource(text_val))
tdm_val = TermDocumentMatrix(
corpus_val,
control = list(
removePunctuation = TRUE,
removestopWords = TRUE,
stemming = FALSE,
removeNumbers = TRUE,
tolower = TRUE,
weighting = weightTfIdf,
dictionary = tdm_train$dimnames$Terms
)
)
however, I keep getting the following error message:
Error in simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allTerms), : 'i, j, v' different lengths
I've read through many posts, including:
- DocumentTermMatrix fails with a strange error only when # terms > 3000
- Twitter Data Analysis - Error in Term Document Matrix
- twitter data <- error in termdocumentmatrix
- Twitter Data Analysis - Error in Term Document Matrix
and I tried ALL of their suggested solutions, but none works.
one note I'd like to add is that the problem only occurs when I use more than about 2000 tweets.
note about the input data:
input data is a data table with two columns, one of which is named "SentimentText" (that you see in my code above).
in this column, each row is one tweet, and each tweet is a text string, aka character().
a sample tweet, aka a row data, look like this: "i had such a wonderful day today! :>"
any help is much much appreciated!