1

I am trying to remove stopwords from a text saved in a text file (size 500 KB). Even after using the instruction multiple times and after every operation like removing punctuation, numbers, I still see stopwords in the word cloud. Did anyone experience the same issue? Is there a fix to it or am I doing something wrong, please advise. Here is the code

library(tm)
library(wordcloud)
lords <- Corpus (DirSource('searsoutlet/'))

lords <- tm_map(lords, removeWords, stopwords('english'))
lords <- tm_map(lords, content_transformer(tolower)) 
lords <- tm_map(lords, removeWords, stopwords('english'))
#wordcloud(lords, scale=c(4,0.5), max.words=100, random.order=1, rot.per=0.35, use.r.layout=FALSE, colors=brewer.pal(10, 'Dark2'))

lords <- tm_map(lords, stripWhitespace)
lords <- tm_map(lords, removeWords, stopwords('english'))
lords <- tm_map(lords,removePunctuation)
lords <- tm_map(lords, removeWords, stopwords('english'))
lords <- tm_map(lords,removeNumbers)
lords <- tm_map(lords, removeWords, stopwords('english'))
Ganesh K
  • 21
  • 3
  • Could you post a sample of the corpus where the problem occurs? http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – scoa Jul 24 '15 at 20:08
  • An example would be nice to see what is going on. But I would use the stopwords only after all the other cleanup you are doing. The stopword list is in lowercase, so first do tolower, remove white spaces, punctuation, numbers, then do the stopwords. – phiver Jul 25 '15 at 08:45
  • @phiver, thanks for helping me here. I have seen the stopwords list here.. http://jmlr.csail.mit.edu/papers/volume5/lewis04a/a11-smart-stop-list/english.stop Thats where I understood that this part is not working. It continues to show words like please, may, able etc in larger size even after following the sequece suggested, which is why I tried to remove stopwords after each step. Stemming the document made results even worse... – Ganesh K Jul 27 '15 at 13:39
  • @scoa, Thanks for your reply. May be I need some help. I am under impression that Corpus will get created automatically and I just need to point R to the doc I want it to analyse. Will it help if I share that file via email? – Ganesh K Jul 27 '15 at 13:42
  • just post the output of `dput(head(lords,1))`. This will give us the text of the first document of your corpus – scoa Jul 27 '15 at 14:01
  • if you type `sort(stopwords("english"))` you will see that this list is different from the list you look at in the link you provided. You can create your own list of stopwords and use this in the stopwords function. – phiver Jul 27 '15 at 14:35
  • @scoa I am processing a 500 kb file.. its a big document and the command you suggested displayed entire data. I am listing first few sentences and the tail which shows some metadata, hope that helps: "Ok", "They will get back to you within 24 to 48 hours by email.", "Really?", "I suppose there's nothing else you can do for me now?", "I want to u see this app as I'm travelling to russia", "Can't you give me a download version?", – Ganesh K Jul 27 '15 at 14:54
  • @scoa had to edit it because of the number of characters.. here is what I get in tail: meta = structure(list(author = character(0), datetimestamp = structure(list( sec = 0.691083908081055, min = 52L, hour = 14L, mday = 27L, mon = 6L, year = 115L, wday = 1L, yday = 207L, isdst = 0L), .Names = c("sec", "min", "hour", "mday", "mon", "year", "wday", "yday", "isdst" ), class = c("POSIXlt", "POSIXt"), tzone = "GMT"), description = character(0), – Ganesh K Jul 27 '15 at 14:56
  • heading = character(0), id = "test.txt", language = "en", origin = character(0)), .Names = c("author", "datetimestamp", "description", "heading", "id", "language", "origin"), class = "TextDocumentMeta")), .Names = c("content", "meta"), class = c("PlainTextDocument", "TextDocument"))), meta = structure(list(), class = "CorpusMeta"), dmeta = structure(list(), .Names = character(0), row.names = 1L, class = "data.frame")), .Names = c("content", "meta", "dmeta"), class = c("VCorpus", "Corpus")) – Ganesh K Jul 27 '15 at 14:56
  • @phiver: That is not good! You are correct. The list which that command gave consists of only 174 words as compared with 571 on that page and none of the words I stated above are covered. May be I need to create a list and then use it as stopwords. Will it work if I save list on that page in a CSV or TXT file, read it in a varialble and use it to process the corpus? – Ganesh K Jul 27 '15 at 15:03
  • @phiver, that really was helpful. I found the Stopwords list in R and edited. Executed the same command and it did the trick!! Thanks for your help. – Ganesh K Jul 27 '15 at 15:37
  • Your welcome. Be aware that if there is an update of the tm package you will lose your update of the Stopwords list. Save them in a separete location. Or create a vector with the following code: `new_stopwords<- readLines("http://jmlr.csail.mit.edu/papers/volume5/lewis04a/a11-smart-stop-list/english.stop")` You can then use `lords <- tm_map(lords, removeWords, new_stopwords)` – phiver Jul 27 '15 at 16:04
  • @phiver .. Sure.. thanks again. – Ganesh K Jul 27 '15 at 16:42

0 Answers0