memory efficient Wordcloud? Huge dataset creates error to allocate vector

Question

I try to make a wordcloud in R but my dataset is too big (500.000 Tweets are in it) But I get always the error message running line

m <- as.matrix(tdm)

"Error: cannot allocate vector of size 539.7 Gb"

Is there a more efficient way in R to create an Wordcloud?

Here is my code so far:

corpus <- Corpus(VectorSource(response$Tweet))

##cleaning Data
corpus <- tm_map(corpus,content_transformer(tolower))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, removePunctuation)

tdm <- TermDocumentMatrix(corpus)
m <- as.matrix(tdm)
v <- sort(rowSums(m),decreasing = TRUE)
d <- data.frame(word = names(v), freq=v)

wordcloud(d$word, d$freq, random.order=FALSE, min.freq = 12, rot.per=0.35,max.words = 150,colors = brewer.pal(8, "Dark2"))

How big is your dataframe `d`? Have you tried decreasing the size of your dataframe (e.g., `head(d, n = 150)`), instead of using the `max.words` argument? — jrcalabrese, Jan 08 '23 at 14:32
@jrcalabrese in my case I don't get to the point that "d" is created. Its failing in creating "m" of my TermDocumentMatrix. My corpus has 482794 elements with 85.4 MB I'm not 100% sure, but I guess this step is vital because there the matrix is created with the word frequencies and the result wouldn't be the same if I only work with head data. — stefan, Jan 08 '23 at 15:54
Ah ok, then it looks like you may have to use an additional package (probably `slam` or `Matrix`); [this person](https://stackoverflow.com/questions/50890935/why-does-as-matrix-result-in-memory-overload-while-running-text-mining-in-r) and [this person](https://stackoverflow.com/questions/66573833/error-cannot-allocate-vector-of-size-38-3-gb-while-creating-a-document-term-mat) had the same issue. — jrcalabrese, Jan 08 '23 at 16:11
@jrcalabrese thank you very much you showed me the right direction. I somehow had issues because Matrix somehow was used and I wasn't succeeding with slam but by continuing to search for an answer with this two packages I came to my main issue with the sparsity. I just needed to add tdms <- removeSparseTerms(tdm, 0.99) to reduce the sparsity and I was able to create the word cloud — stefan, Jan 08 '23 at 18:50

score 1 · Answer 1 · answered Jan 08 '23 at 18:53

1

Before creating the matrix it is necessary to reduce the sparsity after that you need that much RAM to create the matrix tdms <- removeSparseTerms(tdm, 0.99)

answered Jan 08 '23 at 18:53

stefan

11
3

memory efficient Wordcloud? Huge dataset creates error to allocate vector

1 Answers1