2

I am referring to a previously asked question: I want to do a sentiment analysis of German tweets and have been using the code below from the stackoverflow thread I've referred to. However, I would like to do an analysis getting the actual sentiment-scores as a result and not just the sum of TRUE/FALSE, whether a word is positive or negative. Any ideas for an easy way to do this?

You can find the words list also in the previous thread.

library(plyr)
library(stringr)

readAndflattenSentiWS <- function(filename) { 
  words = readLines(filename, encoding="UTF-8")
  words <- sub("\\|[A-Z]+\t[0-9.-]+\t?", ",", words)
  words <- unlist(strsplit(words, ","))
  words <- tolower(words)
  return(words)
}
pos.words <- c(scan("Post3/positive-words.txt",what='character', comment.char=';', quiet=T), 
               readAndflattenSentiWS("Post3/SentiWS_v1.8c_Positive.txt"))
neg.words <- c(scan("Post3/negative-words.txt",what='character', comment.char=';', quiet=T), 
               readAndflattenSentiWS("Post3/SentiWS_v1.8c_Negative.txt"))

score.sentiment = function(sentences, pos.words, neg.words, .progress='none') {
  require(plyr)
  require(stringr)
  scores = laply(sentences, function(sentence, pos.words, neg.words) 
  {
    # clean up sentences with R's regex-driven global substitute, gsub():
    sentence = gsub('[[:punct:]]', '', sentence)
    sentence = gsub('[[:cntrl:]]', '', sentence)
    sentence = gsub('\\d+', '', sentence)
    # and convert to lower case:
    sentence = tolower(sentence)
    # split into words. str_split is in the stringr package
    word.list = str_split(sentence, '\\s+')
    # sometimes a list() is one level of hierarchy too much
    words = unlist(word.list)
    # compare our words to the dictionaries of positive & negative terms
    pos.matches = match(words, pos.words)
    neg.matches = match(words, neg.words)
    # match() returns the position of the matched term or NA
    # I don't just want a TRUE/FALSE! How can I do this?
    pos.matches = !is.na(pos.matches)
    neg.matches = !is.na(neg.matches)
    # and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
    score = sum(pos.matches) - sum(neg.matches)
    return(score)
  }, 
  pos.words, neg.words, .progress=.progress )
  scores.df = data.frame(score=scores, text=sentences)
  return(scores.df)
}

sample <- c("ich liebe dich. du bist wunderbar",
            "Ich hasse dich, geh sterben!", 
            "i love you. you are wonderful.",
            "i hate you, die.")
(test.sample <- score.sentiment(sample, 
                                pos.words, 
                                neg.words))
Community
  • 1
  • 1
juliasb
  • 21
  • 1
  • 2
  • Does your code run and work? I am guessing `laply` is supposed to be `lapply` but the post you quote also wrote that... – Darren Cook May 16 '14 at 00:44
  • Yes, it runs and works. I actually tried it changing laply to lapply and then it didn't work anymore. I'm still rather new to these functions so I'm not sure why... – juliasb May 16 '14 at 13:19
  • Ah, `laply` is part of plyr! Glad I didn't edit to "fix" that now :-) – Darren Cook May 16 '14 at 23:55

2 Answers2

1

Any ideas for an easy way to do this?

Well, yes there is. I am doing the same thing with a lots of tweets. If you are really into Sentiment Analysis you should have a look at the Text Mining (tm) package.

You will see, working from a Document Term Matrix makes life a lot easier. Yet I have to warn you - having read several journals, bag of words methods usually categorize only 60 % of sentiments accurately. If you are really interested in doing high quality research you should dive into the excellent „Artificial Intelligence: A Modern Approch“ by Peter Norvig.

... so this is surely not a quick'n'dirty fix my sentiments approach. However, two months ago I have been at the some point.

However, I would like to do an analysis getting the actual sentiment-scores as a result

As I have been there, you could change your sentiWS to a nice csv file like this (for negative):

NegBegriff  NegWert
Abbau   -0.058
Abbaus  -0.058
Abbaues -0.058
Abbauen -0.058
Abbaue  -0.058
Abbruch -0.0048
...

Then you can import it to R as a nice data.frame. I used this code-snippet:

### for all your words in each tweet in a row
for (n in 1:length(words)) {

  ## get the position of the match /in your sentiWS-file/
  tweets.neg.position <- match(unlist(words[n]), neg.words$NegBegriff)
  tweets.pos.position <- match(unlist(words[n]), pos.words$PosBegriff)

  ## now use the positions, to find the matching values and sum 'em up
  score.pos <- sum(pos.words$PosWert[tweets.pos.position], na.rm = T) 
  score.neg <- sum(neg.words$NegWert[tweets.neg.position], na.rm = T)
  score <- score.pos + score.neg

  ## now we have the sentiment for one tweet, push it to the list
  tweets.list.sentiment <- append(tweets.list.sentiment, score)
  ## and go again.
}

## look how beautiful!
summary(tweets.list.sentiment)

### caveat: This code is pretty ugly and not at all good use of R, 
### however it works sufficiently.  I am using approach from above, 
### thus I did not need to rewrite the latter.  Up to you ;- )

Well, I hope it works. (for my example, it dit)

The trick lies in bringing sentiWS into a nice form, which can be achieved with simple text manipulations using Excel Macros, GNU Emacs, sed or whatever else you feel comfortable working with.

Dennis Proksch
  • 240
  • 2
  • 9
0

As a starting point, this line:

words <- sub("\\|[A-Z]+\t[0-9.-]+\t?", ",", words)

is saying "throw away the POS information and the sentiment value (to just leave you with the word list).

So to do what you want you will need to parse the data differently, and you will need a different data structure. readAndflattenSentiWS is returning a vector currently, but you will want to be returning a lookup table (from string to number: using an env object feels a good fit, though if I also wanted the POS info then a data.frame starts to feel correct).

After that, most of your main loop can be roughly the same, but you'll need to collect the values, and sum them, rather than just sum the number of positive and negative matches.

Darren Cook
  • 27,837
  • 13
  • 117
  • 217