Cannot remove stopwords from my corpus results R programming

Question

Hey i need help in removing words from my results gained through a twitter search, here is the code i use.

library("twitteR")
library("ROAuth")
cred$handshake()
save(cred, file="twitter.Rdata")
load("twitter.Rdata")
registerTwitterOAuth(cred)
tweets = searchTwitter('#apple', n = 100, lang = "en")
tweets.df = twListToDF(tweets)
names(tweets.df)
tweets.df$text
tweet.words = strsplit(tweets.df$text, "[^A-Za-z]+")
word.table = table(unlist(tweet.words))
library("tm")
myStopwords <- c(stopwords('english'), "#apple","http://")
tweet.corpus = Corpus(VectorSource(tweets.df$text))
tweet.corpus = tm_map(tweet.corpus,function(x) iconv(x, to='UTF8', sub='byte'))
tweet.corpus = tm_map(tweet.corpus, PlainTextDocument)
tweet.corpus = tm_map(tweet.corpus,removeWords, myStopwords)
tweet.dtm = DocumentTermMatrix(tweet.corpus) 
tweet.matrix = inspect(tweet.dtm)

But the problem is, it isn't removing the results that contain #apple, and website addresses containing Http:// from the corpus, how i can i remove these results? thank you for the help, Matt.

You really should take the time to create a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Since the problem is really with the `tm` commands, including all the `twitterR` code that we can't run because we don't have the proper credentials is really not helpful. You should include sample data that others can copy/paste to reproduce the problem. — MrFlick, Oct 17 '14 at 03:57

score 2 · Answer 1 · answered Oct 17 '14 at 03:55

The problem is that removeWords really wants to remove "words" not symbols. It actually works via a regular expression like this

function (x, words) 
gsub(sprintf("(*UCP)\\b(%s)\\b", paste(words, collapse = "|")), 
    "", x, perl = TRUE)

So it takes the vector of words collapses them via the regular expression | (or) operator and then removes those terms. Note that it wraps the matching expressions in \b which matches "word boundaries" which is a zero length match between a "word character" and a "non-word character". The problem with your terms is that # and / qualify as non-word characters, therefor you are not matching the boundaries and those terms are not getting replaced.

If you have to remove crazy symbols, you're probably better off writing your own content trasnformer where you can be more explicit about the matching conditions. For example

myremove <- content_transformer(function(x, ...) {
    gsub("(#apple\\b|\\bhttp://)","",x, perl=TRUE)
})

Then you could do

tweets<-c("test one two", "two apples","hashtag #apple", "#apple #tree", "http://microsoft.com")

library("tm")   
tweet.corpus = Corpus(VectorSource(tweets))
tweet.corpus = tm_map(tweet.corpus,content_transformer(function(x) iconv(x, to='UTF8', sub='byte')))
tweet.corpus = tm_map(tweet.corpus,removeWords, stopwords('english'))
tweet.corpus = tm_map(tweet.corpus,myremove)
tweet.dtm = DocumentTermMatrix(tweet.corpus) 
inspect(tweet.dtm)

# <<DocumentTermMatrix (documents: 5, terms: 7)>>
# Non-/sparse entries: 8/27
# Sparsity           : 77%
# Maximal term length: 13
# Weighting          : term frequency (tf)
# 
#     Terms
# Docs #tree apples hashtag microsoft.com one test two
#    1     0      0       0             0   1    1   1
#    2     0      1       0             0   0    0   1
#    3     0      0       1             0   0    0   0
#    4     1      0       0             0   0    0   0
#    5     0      0       0             1   0    0   0

Thus we just add our additional transformation step and we can see those terms are removed from the document term matrix.

score 0 · Answer 2 · answered Oct 17 '14 at 20:23

A slightly different approach using qdap that alters when the (#apple/url) removal occurs and how the Corpus is built:

library(qdap); library(tm)

dat <- data.frame(
    person = paste0("person_", 1:5),
    tweets = c("test one two", "two apples","hashtag #apple", 
        "#apple #tree", "http://microsoft.com")
)

## remove specialty items
dat[["tweets"]] <- rm_default(dat[["tweets"]], pattern=pastex("@rm_url", "#apple\\b"))


myCorp <- tm_map(myCorp, removeWords, stopwords)
myCorp %>% as.dtm() %>% tm::inspect()

## <<DocumentTermMatrix (documents: 5, terms: 7)>>
## Non-/sparse entries: 8/27
## Sparsity           : 77%
## Maximal term length: 13
## Weighting          : term frequency (tf)
## 
##           Terms
## Docs       #tree apples hashtag microsoft.com one test two
##   person_1     0      0       0             0   1    1   1
##   person_2     0      1       0             0   0    0   1
##   person_3     0      0       1             0   0    0   0
##   person_4     1      0       0             0   0    0   0
##   person_5     0      0       0             1   0    0   0

Cannot remove stopwords from my corpus results R programming

2 Answers2