2

I have the code below to create clean texts for my Twitter sentiment analysis. I want to add another line to remove certain words that I don't want to include in this analysis like "crap", "sick", etc. Could someone please advice how to do so?

tweets <- searchTwitter("iPhone", n=1500, lang="en")
txt <- sapply(tweets, function(x) x$getText())
txt <- gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", txt)
txt <- gsub("@\\w+", "", txt)
txt <- gsub("[[:punct:]]", "", txt)
txt <- gsub("[[:digit:]]", "", txt)
txt <- gsub("http\\w+", "", txt)
txt <- gsub("[ \t]{2,}", "", txt)
txt <- gsub("^\\s+|\\s+$", "", txt)
Ryo
  • 157
  • 2
  • 3
  • 15
  • Ryo.. I guess you might have read the blog: https://mkmanu.wordpress.com/2014/08/05/sentiment-analysis-on-twitter-data-text-analytics-tutorial/ – Manoj Kumar Apr 09 '16 at 05:47
  • You can vectorize `gsub`. Check out [this answer on 'Replace multiple arguments with gsub'](http://stackoverflow.com/a/15254254/3560695). This also simplifies your code. – Therkel Apr 09 '16 at 05:54

1 Answers1

0

Using latest "tm" package in R, you can remove words..

library(tm)
myCorpOld <- Corpus(VectorSource(YourFirstDFonTweet$text)

please note in about corpus making, "YourFirstDFonTweet" is the Dataframe that you might have created from the downloaded tweets.

#remove "crap" and "sick" from 
txt <- setdiff(say_txt, c("crap", "sick"))

#remove these form corpus
myCorpUpdate <- tm_map(myCorpOld, txt)

I hope this gives you idea how to resolve your issue.

Manoj Kumar
  • 5,273
  • 1
  • 26
  • 33
  • Is there alternative way to remove those two words using `gsub`? – Ryo Apr 09 '16 at 18:49
  • using gsub, you can remove only one word at a time. for example you have a tweet: data <- c("This is an example tweet. Here is my crap email : emailaddress@try.com. So many crap things here."), and you want to remove word "crap", using gsub.... gsub("crap", "", data) what you get as: "This is an example tweet. Here is my email emailaddress@try.com. So many things here." – Manoj Kumar Apr 09 '16 at 20:25
  • Thank you so much Manoj! – Ryo Apr 09 '16 at 21:04
  • @Ryo i forgot one thing, there might be a white space created while you remove some words using gsub. You can use gsub for white space stripping, if these affect your sentiment scoring, although these should not. – Manoj Kumar Apr 09 '16 at 22:17
  • I thought `txt <- gsub("[ \t]{2,}", "", txt)` and `txt <- gsub("^\\s+|\\s+$", "", txt)` are for removing spaces? Do I need something else to remove while space stripping? – Ryo Apr 10 '16 at 23:06
  • I've tried your answer, but the error showed: no applicable method for 'tm_map' applied to an object of class "list". Do you have any solutions to this? – Ryo Apr 13 '16 at 03:03