I am doing topic modelling on a database containing downloaded tweets using the topicmodels
package in R. I prepare a corpus from the original text of the tweets which I turn into a dfm object. Then, I run the dfm through the LDA function.
However, with the dataset I am using, the dfm has a few rows with no non-zero values, so I'm forced to run dfm_subset in order to be able to run the LDA:
corpus_tweets <- corpus(mydata$text)
corpus_tweets <- iconv(corpus_tweets,"UTF-8","latin1",sub="") #get rid of emojis, end-of-line characters
corpus_tweets <- gsub("#\\w+", "", corpus_tweets) #get rid of hashtags
corpus_tweets <- gsub("[[:punct:]]", "", corpus_tweets) #get rid of punctuation
corpus_tweets <- gsub("[[:digit:]]", "", corpus_tweets) #get rid of numbers
corpus_tweets <- gsub("^\\s+|\\s+$", "", corpus_tweets)
corpus_tweets <- tolower(corpus_tweets)
corpus_tweets <- removeWords(corpus_tweets , stopwords("spanish"))
edsdfm <- tokens(corpus_tweets, remove_punct = T, remove_numbers = T,
remove_url = T, remove_symbols = T) %>%
tokens_ngrams(n = 1:2) %>%
dfm()
edsdfm <- dfm_subset(edsdfm, ntoken(edsdfm) > 0)
This workaround doesn't get me far though, since I run into the issue noted on this thread by Dario Lacan: I can no longer categorise my original tweets by the results of the LDA analysis, since the resulting matrix doesn't correspond anymore to the original dataframe.
Instead, I could use some of the solutions suggested on that thread, but none work for me, since they all hinge on this code:
rowTotals <- apply(dtm , 1, sum) #Find the sum of words in each Document
dtm.new <- dtm[rowTotals> 0, ] #remove all docs without words
But whenever I try to run it, R returns the following error:
Error: cannot allocate vector of size 14.9 Gb
This is possibly due to the large size of the database I'm working with (over 25,000 tweets). I have been stuck here for a whole day and I'm running out of ideas on how to detect which rows contain no non-zero values and delete them on my original database.