I have a matrix with 99,814 items containing reviews and their respective polarities (positive or negative), and I was looking to do some feature selection over the terms of the corpus to select only those that are more determinant for the identification of each score before I pass it to a model.
The problem is I am currently working with 16,554 terms, so trying to transform the document-term matrix into a sparse matrix so I can apply something like chi-squared to the terms is getting me a "Cholmod error out of memory" message.
So my question is: is there any feasible way I can get the chi-squared value of all terms with the matrix in its more "memory efficient" format? Or am I out of luck?
Here's some sample code that should give one an idea of what I am trying to do. I am using the text2vec library to do the transformation on the text.
library(text2vec)
review_matrix <- data.frame(id=c(1,2,3),
review=c('This review is negative',
'This review is positive',
'This review is positive'),
sentiment=c('Negative', 'Positive', 'Positive'))
tokenizer <- word_tokenizer
tokens <- tokenizer(review_matrix$review)
iterator <- itoken(tokens,
ids = review_matrix$reviewId,
progressbar = FALSE)
vocabulary <- create_vocabulary(iterator)
vectorizer <- vocab_vectorizer(vocabulary)
document_term_matrix <- create_dtm(iterator, vectorizer)
model_tf_idf <- TfIdf$new()
document_term_matrix <- model_tf_idf$fit_transform(document_term_matrix)
# This is where I am trying to do the chisq.test