removeWords not working

Question

I am trying to build a wordcloud of the jeopardy dataset found here: https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/

My code is as follows:

library(tm)
library(SnowballC)
library(wordcloud)

jeopQ <- read.csv('JEOPARDY_CSV.csv', stringsAsFactors = FALSE)

jeopCorpus <- Corpus(VectorSource(jeopQ$Question))
jeopCorpus <- tm_map(jeopCorpus, PlainTextDocument)
jeopCorpus <- tm_map(jeopCorpus, removePunctuation)
jeopCorpus <- tm_map(jeopCorpus, removeWords, c('the', 'this', stopwords('english')))
jeopCorpus <- tm_map(jeopCorpus, stemDocument)

wordcloud(jeopCorpus, max.words = 100, random.order = FALSE)

The words 'the' and 'this' are still appearing in the wordcloud. Why is this happening and how can I fix it?

for what it worth the and this are already in `stopwords('english')` — Tensibai, Sep 04 '15 at 12:18

phiver · Accepted Answer · 2015-09-04T14:51:18.083

9

The problem lies in the fact that you didn't perform a lower case action. A lot of questions start with "The". The stopwords are all in lower case, e.g. "the" and "this". Since "The" != "the", "The" it is not removed from the corpus

If you use the code below it should work correctly:

jeopCorpus <- tm_map(jeopCorpus, content_transformer(tolower))
jeopCorpus <- tm_map(jeopCorpus, removeWords, stopwords('english'))
jeopCorpus <- tm_map(jeopCorpus, removePunctuation)
jeopCorpus <- tm_map(jeopCorpus, PlainTextDocument)
jeopCorpus <- tm_map(jeopCorpus, stemDocument)

wordcloud(jeopCorpus, max.words = 100, random.order = FALSE)

edited Sep 04 '15 at 14:51

answered Sep 04 '15 at 13:39

phiver

23,048
14
44
56

1

I think that the line with `removePunctuation` should be placed *after* `removeWords, stopwords("en")`. If you look at `stopwords("en")`, there are many words with apostrophes, like "didn't", "we'll", "he'd" which would not be recognized if the punctuation was removed before attempting to remove these words. – RHertel Sep 04 '15 at 14:17

score 0 · Answer 2 · edited May 23 '17 at 12:30

0

The construction of argument does not seem right:see here and here

tm_map(jeopCorpus, removeWords, c(stopwords("english"),"the","this"))

But as said, those words are already included, so simply

tm_map(jeopCorpus, removeWords, stopwords("english"))

should work

edited May 23 '17 at 12:30

Community

1
1

answered Sep 04 '15 at 12:24

PereG

1,796
2
22
23

I have already tried both of those and it didn't work. Do you think it has something to do with the version? When I type out stopwords() to see all the stop words, I don't see 'the' and 'this' in the list. What am I missing? – ytk Sep 04 '15 at 13:03
1

I think that the answer of @phiver leads to the right solution. I believe it still contains a little mistake concerning the placement of `removePunctuation` that I mentioned in a comment, but apart from that it should work. The essential point is that you must transform the text to lower case using `content_transformer(tolower))` *before* removing the stopwords. The words "the" and "this" are in the list of `stopwords("en")`, but only in lower case, and not with a capital "T" as they may occur at the beginning of sentences. – RHertel Sep 04 '15 at 14:20

removeWords not working

2 Answers2