0

I have a matrix with 99,814 items containing reviews and their respective polarities (positive or negative), and I was looking to do some feature selection over the terms of the corpus to select only those that are more determinant for the identification of each score before I pass it to a model.

The problem is I am currently working with 16,554 terms, so trying to transform the document-term matrix into a sparse matrix so I can apply something like chi-squared to the terms is getting me a "Cholmod error out of memory" message.

So my question is: is there any feasible way I can get the chi-squared value of all terms with the matrix in its more "memory efficient" format? Or am I out of luck?

Here's some sample code that should give one an idea of what I am trying to do. I am using the text2vec library to do the transformation on the text.

library(text2vec)

review_matrix <- data.frame(id=c(1,2,3),
                            review=c('This review is negative',
                                     'This review is positive',
                                     'This review is positive'),
                            sentiment=c('Negative', 'Positive', 'Positive'))


tokenizer <- word_tokenizer
tokens <- tokenizer(review_matrix$review)

iterator <- itoken(tokens, 
                   ids = review_matrix$reviewId, 
                   progressbar = FALSE)

vocabulary <- create_vocabulary(iterator)
vectorizer <- vocab_vectorizer(vocabulary)
document_term_matrix <- create_dtm(iterator, vectorizer)
model_tf_idf <- TfIdf$new()
document_term_matrix <- model_tf_idf$fit_transform(document_term_matrix)

# This is where I am trying to do the chisq.test
  • Does [this SO answer](https://stackoverflow.com/a/38570220/4985176) help you? – phiver Dec 22 '20 at 17:27
  • The one I tried was actually quite similar to the first one on that page, but it turns the DTM into a matrix and it blows up my memory as well. So no luck there. But I haven't tried the one with textstat_keyness(). I will see if it works. – Matheus Correia Dec 22 '20 at 19:07

0 Answers0