How to create wordclouds for text files in a directory in R

Question

I am trying to create a wordcloud for each text file in a directory. They are four presidential announcement speeches. I keep getting the following message:

> cname <- file.path("C:", "texts")
> cname
[1] "C:/texts"

> cname <- file.path("C:\\Users\\BonitaW\\Documents\\DATA630\\texts")
> dir(cname)
[1] "berniesandersspeechtranscript20115.txt"
[2] "hillaryclintonspeechtranscript2015.txt"
[3] "jebbushspeechtranscript2015.txt"       
[4] "randpaulspeechtranscript2015.txt"      
> library(tm)
> docs <- Corpus(DirSource(cname)) 
> summary (docs)
                                   Length
berniesandersspeechtranscript20115.txt 2     
hillaryclintonspeechtranscript2015.txt 2     
jebbushspeechtranscript2015.txt        2     
randpaulspeechtranscript2015.txt       2     
                                   Class            
berniesandersspeechtranscript20115.txt PlainTextDocument
hillaryclintonspeechtranscript2015.txt PlainTextDocument
jebbushspeechtranscript2015.txt        PlainTextDocument
randpaulspeechtranscript2015.txt       PlainTextDocument
                                   Mode
berniesandersspeechtranscript20115.txt list
hillaryclintonspeechtranscript2015.txt list
jebbushspeechtranscript2015.txt        list
randpaulspeechtranscript2015.txt       list
> docs <- tm_map(docs, removePunctuation) 
> docs <- tm_map(docs, removeNumbers)
> docs <- tm_map(docs, removeWords, stopwords("english"))
> library(SnowballC) 
Warning message:
package ‘SnowballC’ was built under R version 3.1.3 
> docs <- tm_map(docs, stemDocument)
> docs <- tm_map(docs, stripWhitespace) 
> docs <- tm_map(docs, PlainTextDocument)
> dtm <- DocumentTermMatrix(docs)
> dtm
<<DocumentTermMatrix (documents: 4, terms: 1887)>>
Non-/sparse entries: 2862/4686
Sparsity           : 62%
Maximal term length: 20
Weighting          : term frequency (tf)
> tdm <- TermDocumentMatrix(docs) 
> tdm
<<TermDocumentMatrix (terms: 1887, documents: 4)>>
Non-/sparse entries: 2862/4686
Sparsity           : 62%
Maximal term length: 20
Weighting          : term frequency (tf)

> library(wordcloud)
> Berniedoc <- wordcloud(names(freq), freq, min.freq=25)   
Warning message:
In wordcloud(names(freq), freq, min.freq = 25) :
american could not be fit on page. It will not be plotted.

Initially I was able to plot Berniedoc but lost the graphic but it will not plot now.

 Berniedoc <- wordcloud(names(freq), freq, min.freq=25)   
Warning messages:
1: In wordcloud(names(freq), freq, min.freq = 25) :
american could not be fit on page. It will not be plotted.
2: In wordcloud(names(freq), freq, min.freq = 25) :
 work could not be fit on page. It will not be plotted.
3: In wordcloud(names(freq), freq, min.freq = 25) :
countri could not be fit on page. It will not be plotted.
4: In wordcloud(names(freq), freq, min.freq = 25) :
year could not be fit on page. It will not be plotted.
5: In wordcloud(names(freq), freq, min.freq = 25) :
new could not be fit on page. It will not be plotted.
6: In wordcloud(names(freq), freq, min.freq = 25) :
see could not be fit on page. It will not be plotted.
7: In wordcloud(names(freq), freq, min.freq = 25) :
and could not be fit on page. It will not be plotted.
8: In wordcloud(names(freq), freq, min.freq = 25) :
can could not be fit on page. It will not be plotted.
9: In wordcloud(names(freq), freq, min.freq = 25) :
time could not be fit on page. It will not be plotted.

Could you please tell me what I am doing wrong? Could it be the scaling? Or should I change 'Berniedoc' to something else?

Well, those aren't errors, those are warnings telling you that a particular word appears so much more often that it's too large to print in the plotting device. It's possible you are trying to print too many words. It would help to have a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). None of the file reading seems to matter for your problem; it appears to be exclusive to the plotting. Maybe try `max.words=50` to do just the top 50. — MrFlick, May 11 '15 at 05:33
In addition to MrFlick's suggestion to limit the number of terms, you can vary the size of the words with the scale argument to wordcloud. For example, try scale=c(.5,1) to squeeze more but smaller terms on the plot. — lawyeR, May 11 '15 at 10:08

Ken Benoit · Answer 1 · 2017-12-21T10:15:50.107

How about an alternative approach, using the quanteda package?

You will need to change the directory references of course for your own examples. Setting the size of the pdf window should make the warnings go away.

require(quanteda)

# load the files into a quanteda corpus
myCorpus <- corpus(textfile("~/Dropbox/QUANTESS/corpora/inaugural/*.txt"))
ndoc(myCorpus)
## [1] 57

# create a document-feature matrix, removing stopwords
myDfm <- dfm(myCorpus, remove = stopwords("english"))
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing 57 documents
## ... shaping tokens into data.table, found 134,024 total tokens
## ... ignoring 174 feature types, discarding 69,098 total features (51.6%)
## ... summing tokens by document
## ... indexing 8,958 feature types
## ... building sparse matrix
## ... created a 57 x 8958 sparse dfm
## ... complete. Elapsed time: 0.256 seconds.

# just do first four
for (i in 1:4) {
    pdf(file = paste0("~/tmp/", docnames(myCorpus)[i], ".pdf"), height=12, width=12)
    textplot_wordcloud(myDfm[i, ])  # pass through any arguments you wish to wordcloud()
    dev.off()
}

score 0 · Answer 2 · answered Nov 22 '16 at 18:46

0

You should add a "max.words" restriction to the number of words.

Berniedoc <- wordcloud(names(freq), freq, min.freq=25, max.words = 50)

answered Nov 22 '16 at 18:46

Schindler

1

score 0 · Answer 3 · answered Sep 20 '17 at 15:46

I think it would be simpler with a reproducible example. I have no idea of what "C:\\Users\\BonitaW\\Documents\\DATA630\\texts" is. But I can tell you I just come to solve a quite similar problem.

All you need to do is to play with the scale parameter of wordcloud. In particular, with the first number, which represents the range (not the size).

How to create wordclouds for text files in a directory in R

3 Answers3