1

I am trying to perform sentiment analysis on the tweets that were already fetched and stored in MongoDb. After fetching the tweets which is in dataframe format, i am getting the following error:

ip.txt=laply(ip.lst,function(t) t$getText())
Error in t$getText : $ operator is invalid for atomic vectors

The entire code is given below:

iphone.tweets <- searchTwitter('#iphone', n=15, lang="en")
iphone.text=laply(iphone.tweets,function(t) t$getText())
df_ip <- as.data.frame(iphone.text)

m <- mongo("iphonecollection",db="project")
m$insert(df_ip)
df_ip<-m$find()
ip.lst<-as.list(t(df_ip))
ip.txt=laply(ip.lst,function(t) t$getText())

What I wish to do is to calculate the sentiment scores as follows:

iphone.scores <- score.sentiment(ip.txt, pos.words,neg.words, .progress='text')

score.sentiment routine is as follows:

  score.sentiment = function(sentences, pos.words, neg.words, .progress='none')
{
  require(plyr)
  require(stringr)
   # we got a vector of sentences. plyr will handle a list or a vector as an "l" for us
   # we want a simple array of scores back, so we use "l" + "a" + "ply" = laply:
  scores = laply(sentences, function(sentence, pos.words, neg.words) {
    # clean up sentences with R's regex-driven global substitute, gsub():
    sentence = gsub('[[:punct:]]', '', sentence)
    sentence = gsub('[[:cntrl:]]', '', sentence)
    sentence = gsub('\\d+', '', sentence)
    # and convert to lower case:
    sentence = tolower(sentence)
    # split into words. str_split is in the stringr package
    word.list = str_split(sentence, '\\s+')
    # sometimes a list() is one level of hierarchy too much
    words = unlist(word.list)
    # compare our words to the dictionaries of positive & negative terms
    pos.matches = match(words, pos.words)
    neg.matches = match(words, neg.words)
    # match() returns the position of the matched term or NA
    # we just want a TRUE/FALSE:
    pos.matches = !is.na(pos.matches)
    neg.matches = !is.na(neg.matches)
    # and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
    score = sum(pos.matches) - sum(neg.matches)
    return(score)
   }, pos.words, neg.words, .progress=.progress )
   scores.df = data.frame(score=scores, text=sentences)
   return(scores.df)
 } 
SymbolixAU
  • 25,502
  • 4
  • 67
  • 139
VBB
  • 15
  • 6
  • A few things. Where is your `score.sentiment` routine coming from? What is the point of the mongo db? And why can't you just put the `ip.lst` directly into the `score.sentiment` routine? – Mike Wise Dec 26 '15 at 08:21
  • Instead of fetching the tweets all the time, I intend to store them for once into Mongodb and fetch and process tweets from there instead. – VBB Dec 29 '15 at 04:11

1 Answers1

1

I think you wanted to use sapply, which flattens the list of status object that searchTwitter returns. In any case this works. Note that you need to install and then start MongoDB for this to work:

library(twitteR)
library(plyr)
library(stringr)
library(mongolite)

# you have to set up a Twitter Application at https://dev.twitter.com/ to get these 
#
ntoget <- 600 # get 600 tweets

iphone.tweets <- searchTwitter('#iphone', n=ntoget, lang="en")
iphone.text <- sapply(iphone.tweets,function(t) t$getText())
df_ip <- as.data.frame(iphone.text)

# MongoDB must be installed and the service started (mongod.exe in Windows)
#
m <- mongo("iphonecollection",db="project")
m$insert(df_ip)
df_ip_out<-m$find()

# Following routine (score.sentiment) was copied from:
# http://stackoverflow.com/questions/32395098/r-sentiment-analysis-with-phrases-in-dictionaries
#
score.sentiment = function(sentences, pos.words, neg.words, .progress='none')
{
  require(plyr)  
  require(stringr)  
  # we got a vector of sentences. plyr will handle a list  
  # or a vector as an "l" for us  
  # we want a simple array ("a") of scores back, so we use  
  # "l" + "a" + "ply" = "laply":  
  scores = laply(sentences, function(sentence, pos.words, neg.words) {
    # clean up sentences with R's regex-driven global substitute, gsub():
    sentence = gsub('[[:punct:]]', '', sentence)
    sentence = gsub('[[:cntrl:]]', '', sentence)
    sentence = gsub('\\d+', '', sentence)    
    # and convert to lower case:    
    sentence = tolower(sentence)    
    # split into words. str_split is in the stringr package    
    word.list = str_split(sentence, '\\s+')    
    # sometimes a list() is one level of hierarchy too much    
    words = unlist(word.list)    
    # compare our words to the dictionaries of positive & negative terms
    pos.matches = match(words, pos)
    neg.matches = match(words, neg)   
    # match() returns the position of the matched term or NA    
    # we just want a TRUE/FALSE:    
    pos.matches = !is.na(pos.matches)   
    neg.matches = !is.na(neg.matches)   
    # and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
    score = sum(pos.matches) - sum(neg.matches)    
    return(score)    
  }, pos.words, neg.words, .progress=.progress )  
  scores.df = data.frame(score=scores, text=sentences)  
  return(scores.df)  
}

tweets <- as.character(df_ip_out$iphone.text)
neg = c("bad","prank","inferior","evil","poor","minor")
pos = c("good","great","superior","excellent","positive","super","better")
analysis <- score.sentiment(tweets,pos,neg)
table(analysis$score)

Yields the following (4 scored bad, 592 scored neutral, 4 scored good):

 -1   0   1 
  4 592   4 
Mike Wise
  • 22,131
  • 8
  • 81
  • 104
  • Thank you. Could you also tell me what does the following line in your code actually do: tweets <- as.character(df_ip_out$iphone.text) – VBB Dec 29 '15 at 04:09
  • It converts the `df_ip_out$phone.text` vector from a factor vector to a character vector. You can see the type of a vector by using the `class()` function. – Mike Wise Dec 29 '15 at 08:30
  • And please mark this as correct, presuming you think it is. – Mike Wise Dec 29 '15 at 08:31
  • why is iphone.text used in as.character(df_ip_out$iphone.text)? My aim here is to process tweets that are fetched from mongoDB only. iphone.txt is obtained from the tweets returned by searchTwitter function. I want the variable tweets to be independent of the tweets fetched initially. It should only dpend on the data in mongoDB. – VBB Dec 30 '15 at 04:48
  • I think you are confusing the `df_ip` dataframe, which is built from data retrieved by `searchTwitter`, and the `df_ip_out` dataframe, which is built from data retrieved from the `m$find` mongo retrieval function. – Mike Wise Dec 30 '15 at 10:53
  • Actually df_ip and df_ip_out have the same contents right? – VBB Jan 02 '16 at 15:03
  • Actually `df_ip_out` keeps getting bigger every time you run it, since the newest `df_ip` data gets added to the database I noticed. Was too lazy to look up the commands to empty the db. – Mike Wise Jan 02 '16 at 15:08