3

I have a dataframe (myDF) with 700.000+ rows, each row has two columns, id and text. The text has 140 character texts (tweets) and I would like to run a sentiment analysis that I got off the web on them. However, no matter what I try, I have memory problems on a macbook with 4gb ram.

I was thinking that maybe I could loop through rows, e.g. do the first 10, and then the second 10...etc. (I run into problems even with batches of 100) Would this solve the problem? What is the best way to loop in such way?

I am posting my code here:

library(plyr)
library(stringr)

# function score.sentiment
score.sentiment = function(sentences, pos.words, neg.words, .progress='none')
{
   # Parameters
   # sentences: vector of text to score
   # pos.words: vector of words of postive sentiment
   # neg.words: vector of words of negative sentiment
   # .progress: passed to laply() to control of progress bar

   # create simple array of scores with laply
   scores = laply(sentences,
   function(sentence, pos.words, neg.words)
   {

      # split sentence into words with str_split (stringr package)
      word.list = str_split(sentence, "\\s+")
      words = unlist(word.list)

      # compare words to the dictionaries of positive & negative terms
      pos.matches = match(words, pos.words)
      neg.matches = match(words, neg.words)

      # get the position of the matched term or NA
      # we just want a TRUE/FALSE
      pos.matches = !is.na(pos.matches)
      neg.matches = !is.na(neg.matches)

      # final score
    score = sum(pos.matches)- sum(neg.matches)
      return(score)
      }, pos.words, neg.words, .progress=.progress )

   # data frame with scores for each sentence
   scores.df = data.frame(text=sentences, score=scores)
   return(scores.df)
}

# import positive and negative words
pos = readLines("positive_words.txt")
neg = readLines("negative_words.txt")

# apply function score.sentiment


myDF$scores = score.sentiment(myDF$text, pos, neg, .progress='text') 
d12n
  • 841
  • 2
  • 10
  • 20
  • 4
    Don't post your *entire* code!? It's too much to go through, instead focus on creating a [***small reproducible example***](http://stackoverflow.com/q/5963269/1478381). Much more likely to get help that way. – Simon O'Hanlon Apr 28 '13 at 20:35
  • Thanks! I made it shorter, removed the text manipulation parts (since it causes the same problem without those parts as well) and I don't think I could give a sample data frame, since my problem is when I have more than a few dozen rows, so any problematic data frame I can provide would be way too long. – d12n Apr 28 '13 at 20:40
  • 1
    This is still not reproducible. Please read the linked post carefully, and pay particular attention to the use of `dput( head( pos ) )`. It will really help. – Simon O'Hanlon Apr 28 '13 at 21:20
  • I know it isn't reproducible, but as I said, if I wanted to reproduce the problem, I would have to dput not only the head but at least 500 rows of the dataframe -since it works fine otherwise-, which would make the post quite illegible. I can do it, if you think it would help. But what I am looking for is whether anybody else has run into a similar memory problem (not necessarily through this code), and what kind of a loop would be best to use in order to go through the rows batch by batch. I know that for loops don't work very well in R, so I am looking for other alternatives. – d12n Apr 28 '13 at 21:25
  • 3
    (1) "for loops don't work well in R" is a myth, (2) with vague, non-reproducible out-of-memory questions like this, it's really impossible to be much help. Try using **data.table** is about as helpful as anyone will be able to be. – joran Apr 28 '13 at 21:28
  • You don't need to post it all, just a small set. We can test different solutions on a small subset to test their relative speed from your baseline code. – Simon O'Hanlon Apr 28 '13 at 21:28
  • I understand and I am trying, but things have gone crazy over here. dput(head(df)) wants to dput the whole dataframe, and when I make a new dataframe out of just the head and dput that, the same thing happens. This is very strage. I will keep on trying though! – d12n Apr 28 '13 at 22:15
  • likely your data frame treats columns as factors, and dput is printing the factor levels; create the data frame using the argument `stringsAsFactors=FALSE`; this might help substantially with your overall memory use. – Martin Morgan Apr 28 '13 at 23:24
  • .@Joran's suggestion is probably your best bet. At the very least, separate `DF$Scores` from `DF$Texts`. Instead of having them in one `data.frame`, put each into **its own** separate `list`. – Ricardo Saporta Apr 29 '13 at 00:35
  • It is not done yet, it is still running, but what @MartinMorgan said was probably true, the problem was that text was seen as factors. stringAsFactors() didn't work, so I did an as.character conversion, and checked it with str(), it did convert to characters and I ran the script again, it did not freeze, so that was probably the problem. And even though it is using 100.2% of CPU at times,it is using way less memory this way. I will update once it is done! Thank you so much. You might want to post it as an answer, so others with the same problem can see it too (there are many others afaik). – d12n Apr 29 '13 at 14:39

3 Answers3

5

4 GB sounds like enough memory for 700,000 140-character sentences. A different way to calculate your sentiment scores might be more memory and time efficient and / or easier to break into chunks. Instead of processing each sentence, break the entire group of sentences into words

words <- str_split(sentences, "\\s+")

then determine how many words you have in each sentence, and create a single vector of words

len <- sapply(words, length)
words <- unlist(words, use.names=FALSE)

By re-using the words variable I free up the memory used previously for re-cycling (no need to explicitly call the garbage collector, contrary to the advice in @cryo111 !). You can find whether a word is in pos.words or not, without worrying about NAs, with words %in% pos.words. But we can be a bit clever and calculate the cumulative sum of this logical vector, and then subset the cumulative sum at the last word in each sentence

cumsum(words %in% pos.words)[len]

and calculate the number of words as the derivative of this

pos.match <- diff(c(0, cumsum(words %in% pos.words)[len]))

This is the pos.match portion of your score. So

scores <- diff(c(0, cumsum(words %in% pos.words)[len])) - 
          diff(c(0, cumsum(words %in% neg.words)[len]))

and that's it.

score_sentiment <-
    function(sentences, pos.words, neg.words)
{
    words <- str_split(sentences, "\\s+")
    len <- sapply(words, length)
    words <- unlist(words, use.names=FALSE)
    diff(c(0, cumsum(words %in% pos.words)[len])) - 
      diff(c(0, cumsum(words %in% neg.words)[len]))
}

The intention here is that this processes all your sentences in a single pass

myDF$scores <- score_sentiment(myDF$text, pos, neg)

This avoids for loops which, while not inherently inefficient compared to lapply and friends when implemented correctly as indicated by @joran, are very inefficient compared to vectorized solutions. Probably sentences doesn't get copied here, and returning (just) the score won't waste memory returning information (the sentences) we already know about. The biggest memory will be sentences and words.

If memory is still a problem, then I'd create an index that can be used to split the text into smaller groups, and calculate the score of each

nGroups <- 10 ## i.e., about 70k sentences / group
idx <- seq_along(myDF$text)
grp <- split(idx, cut(idx, nGroups, labels=FALSE))
scorel <- lapply(grp, function(i) score_sentiment(myDF$text[i], pos, neg))
myDF$scores <- unlist(scorel, use.names=FALSE)

making sure first that myDF$text is in fact a character, e.g., myDF$test <- as.character(myDF$test)

Martin Morgan
  • 45,935
  • 7
  • 84
  • 112
  • 1
    Wow, I can't believe this! After spending about 8 hours on this, trying for loops for a few hours (managed to get through 11.000 rows in 4 hours) your solution just worked! It finished in about 20 minutes, this is incredible. I had no idea good design made SUCH huge of a difference, I am just in awe right now. I really hope I will be able to do this on my own one day, since I still don't exactly know why your solution was so much faster. – d12n Apr 29 '13 at 21:53
  • Awesome answer. Maybe there should be one more line of code such as len=cumsum(len) ? – wind Sep 08 '13 at 15:58
1

I think it's rather hard to give a definitive answer to your problem, but here a few pointers. What helped for me was frequent use of the garbage collector gc() as well as removing objects that are no longer needed from the memory rm(obj_name). You could also consider transferring your data into a database such as MySQL. This is rather easy if you export your dataframe as a csv and use LOAD DATA INFILE ... Then it should be possible to loop through much larger chunks than 100 rows (The RODBC package is a good tool to access SQL databases from R). Another alternative would be to hold your data in an external file and to read the data block-wise but I do not know how this can be done effectively in R. It's also useful to keep an eye on the Resource Monitor (Task manager - Performance - Resource Monitor - Memory).

BTW: As far as I have read, a single twitter message can be 560 Bytes long (max). 700k entries gives roughly 400MB of data. Although this is a rather large amount of data, it should be no problem for 4GB RAM. Do you have some other data in your memory? Do you have other programs running?

cryo111
  • 4,444
  • 1
  • 15
  • 37
0

If I understand correctly, you want to apply a function to sets of ten lines using a loop. Here's a generic way to do that. I first create a list with sets of ten lines using split. They are not ordered, but it should not matter as you can reorder at the end if you want. You then apply your function in a loop and add the results in an 'out' vector using rbind.

x <-matrix(1:100,ncol=1)
parts.start <-split(1:100,1:10) #creates list: divide in 10 sets of 10 lines

out <-NULL
for (i in 1:length(parts.start)){
res <- x[parts.start[[i]],,drop=FALSE]*2 #your function applied to elements of the list.
out <-rbind(out,res)
}
head(out)

     [,1]
[1,]    2
[2,]   22
[3,]   42
[4,]   62
[5,]   82
[6,]  102
Pierre Lapointe
  • 16,017
  • 2
  • 43
  • 56