0

I try to make a wordcloud in R but my dataset is too big (500.000 Tweets are in it) But I get always the error message running line

m <- as.matrix(tdm)

"Error: cannot allocate vector of size 539.7 Gb"

Is there a more efficient way in R to create an Wordcloud?

Here is my code so far:

corpus <- Corpus(VectorSource(response$Tweet))
##cleaning Data
corpus <- tm_map(corpus,content_transformer(tolower))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, removePunctuation)
tdm <- TermDocumentMatrix(corpus)
m <- as.matrix(tdm)
v <- sort(rowSums(m),decreasing = TRUE)
d <- data.frame(word = names(v), freq=v)
wordcloud(d$word, d$freq, random.order=FALSE, min.freq = 12, rot.per=0.35,max.words = 150,colors = brewer.pal(8, "Dark2"))
stefan
  • 11
  • 3
  • How big is your dataframe `d`? Have you tried decreasing the size of your dataframe (e.g., `head(d, n = 150)`), instead of using the `max.words` argument? – jrcalabrese Jan 08 '23 at 14:32
  • @jrcalabrese in my case I don't get to the point that "d" is created. Its failing in creating "m" of my TermDocumentMatrix. My corpus has 482794 elements with 85.4 MB I'm not 100% sure, but I guess this step is vital because there the matrix is created with the word frequencies and the result wouldn't be the same if I only work with head data. – stefan Jan 08 '23 at 15:54
  • Ah ok, then it looks like you may have to use an additional package (probably `slam` or `Matrix`); [this person](https://stackoverflow.com/questions/50890935/why-does-as-matrix-result-in-memory-overload-while-running-text-mining-in-r) and [this person](https://stackoverflow.com/questions/66573833/error-cannot-allocate-vector-of-size-38-3-gb-while-creating-a-document-term-mat) had the same issue. – jrcalabrese Jan 08 '23 at 16:11
  • 1
    @jrcalabrese thank you very much you showed me the right direction. I somehow had issues because Matrix somehow was used and I wasn't succeeding with slam but by continuing to search for an answer with this two packages I came to my main issue with the sparsity. I just needed to add tdms <- removeSparseTerms(tdm, 0.99) to reduce the sparsity and I was able to create the word cloud – stefan Jan 08 '23 at 18:50

1 Answers1

1

Before creating the matrix it is necessary to reduce the sparsity after that you need that much RAM to create the matrix tdms <- removeSparseTerms(tdm, 0.99)

stefan
  • 11
  • 3