Identifying rows deleted with dfm_subset()

Question

I am doing topic modelling on a database containing downloaded tweets using the topicmodels package in R. I prepare a corpus from the original text of the tweets which I turn into a dfm object. Then, I run the dfm through the LDA function.

However, with the dataset I am using, the dfm has a few rows with no non-zero values, so I'm forced to run dfm_subset in order to be able to run the LDA:

corpus_tweets <- corpus(mydata$text)
corpus_tweets <- iconv(corpus_tweets,"UTF-8","latin1",sub="") #get rid of emojis, end-of-line characters
corpus_tweets <- gsub("#\\w+", "", corpus_tweets) #get rid of hashtags
corpus_tweets <- gsub("[[:punct:]]", "", corpus_tweets) #get rid of punctuation 
corpus_tweets <- gsub("[[:digit:]]", "", corpus_tweets) #get rid of numbers 
corpus_tweets <- gsub("^\\s+|\\s+$", "", corpus_tweets)
corpus_tweets <- tolower(corpus_tweets) 
corpus_tweets  <- removeWords(corpus_tweets , stopwords("spanish"))
    
edsdfm <- tokens(corpus_tweets, remove_punct = T, remove_numbers = T,
                 remove_url = T, remove_symbols = T) %>% 
  tokens_ngrams(n = 1:2) %>% 
  dfm()

edsdfm <- dfm_subset(edsdfm, ntoken(edsdfm) > 0)

This workaround doesn't get me far though, since I run into the issue noted on this thread by Dario Lacan: I can no longer categorise my original tweets by the results of the LDA analysis, since the resulting matrix doesn't correspond anymore to the original dataframe.

Instead, I could use some of the solutions suggested on that thread, but none work for me, since they all hinge on this code:

rowTotals <- apply(dtm , 1, sum) #Find the sum of words in each Document
dtm.new   <- dtm[rowTotals> 0, ]           #remove all docs without words

But whenever I try to run it, R returns the following error:

Error: cannot allocate vector of size 14.9 Gb

This is possibly due to the large size of the database I'm working with (over 25,000 tweets). I have been stuck here for a whole day and I'm running out of ideas on how to detect which rows contain no non-zero values and delete them on my original database.

It looks like you might be using an older version of quanteda. Also you should process the removals you want during or after tokenisation. The way to detect documents that have no tokens is `zerofreq <- featfreq(edsdfm) == 0`. — Ken Benoit, Sep 22 '21 at 14:28
I have updated the quanteda package on my machine. Interestingly enough, `zerofreq <- featfreq(edsdfm) == 0` returns a vector with FALSE in every position. I don't really understand why, as I can successfully run the LDA with the trimmed dfm (there are only about 80 documents with no tokens). — Javier Sacristán, Sep 23 '21 at 08:36

Identifying rows deleted with dfm_subset()

0 Answers0