0

Good morning,

I'm running a block of code and its taking a bit too long. The goal is to make a "moving average" of the last X number of days. In this case, the previous 2, 3, 4, and 5 days of scores. It needs to be from the last date Date, for a unique id EID. Here it the data:

df:
     EID            Date        Score
     1111         5/25/2015        10
     1111         5/29/2015         6
     1111         6/17/2015         9
    12345         5/27/2015        10
    12345          1/1/2015         8
    12345          1/7/2015         9
    12345          1/9/2015        10
    12345         1/10/2015         7
    19611         1/13/2015         8
    19611         1/21/2015        10
    19611         1/23/2015         9
    19611         1/24/2015        10
    19611         1/30/2015         5
    19611          2/5/2015         6
    19611         2/11/2015        10
    19611         2/12/2015         7
    19611         2/14/2015        10
    19611         2/15/2015         6
    19611         2/18/2015        10
    19611         2/19/2015        10

This is what I'm currently running on 500,000+ rows across 6 data frames

uniqueID <- unique(df$EID)
rowNr    <- lapply(uniqueID,function(uniqueID){which(df$EID==uniqueID)})
lastDate <- lapply(rowNr,function(n){df$Date[rev(n)[1]]})
Avg      <- lapply(rowNr,function(n){mean(df$Score[n])})
prev2    <- lapply(rowNr,function(n){mean(df$Score[head(tail(c(NA,n),3),2)])})
prev3    <- lapply(rowNr,function(n){mean(df$Score[head(tail(c(NA,n),4),3)])})
prev4    <- lapply(rowNr,function(n){mean(df$Score[head(tail(c(NA,n),5),4)])})
prev5    <- lapply(rowNr,function(n){mean(df$Score[head(tail(c(NA,n),6),5)])})

Scores <- data.frame(EID       = uniqueID,                       
                     avg_score = unlist(Avg),                       
                     score2    = unlist(prev2),                       
                     score3    = unlist(prev3),                       
                     score4    = unlist(prev4),                       
                     score5    = unlist(prev5))

Here are the results

View(Scores)
EID     avg_score   score2  score3  score4  score5
1111    8.33        7.50    8.33    NA      NA
12345   8.80        8.50    8.67    8.50    8.80
19611   8.42        10.00   8.67    9.00    8.60

Any ideas to make this run faster? Im currently running this on 6 data frames and it takes 10-15 minutes for each one to process. How can this be optimized to run faster?

Thanks!

  • Check out the `data.table` package. It plays nicer with large data than base R. – Richard Erickson Oct 08 '15 at 15:27
  • 1
    Possible dupe: [Calculating moving average in R](http://stackoverflow.com/q/743812/903061)? It's not performance-focused, but I would bet most of the answers are more performant than this one, and a couple of them mention performance (see the `caTools` and `RcppRoll` answers). – Gregor Thomas Oct 08 '15 at 15:29
  • ...though it could definitely do with a `data.table` answer. – Gregor Thomas Oct 08 '15 at 15:31
  • Can't you put `lastDate`, `Avg`, `prev2`, `prev3`, `prev4`, `prev5` in a single `lapply`? They all use `rowNr` for their iterative value and look independent of each other. – mattdevlin Oct 08 '15 at 15:55
  • Looking at the first group `"1111"`, the `"score2"` column has `8.0`. That is the average of `(10+6)/2`. That's the first two days, not the last two. Is that the desired output? Because for the third group you are selecting 6 and 10. That doesn't make sense. How could both of those be the correct moving average for score2? – Pierre L Oct 08 '15 at 16:20
  • Pierre Lafortune, good catch. It should be the last 2. I need to figure out whats wrong and fix. Ultimately, the result should be from the last day of observations for a given EID. – Eddie Fernandez Oct 08 '15 at 16:30
  • I'm still new so my apologies. – Eddie Fernandez Oct 08 '15 at 16:30
  • 2
    This line `aggregate(df$Score,df["EID"],function(x) {y<-rev(x[-length(x)]);length(y)<-5;c(mean(x),cumsum(y)[-1]/2:5)})` reproduces your final result. – nicola Oct 08 '15 at 16:32
  • @nicola, Thank you for the help!!! This is brilliant. I dont understand it, but it does produce the original answers I posted. However, it was brought to my attention that I was calculating those differently than I described (I've updated my solution). I need to include the last score when calculating the mean over the previous 2,3,4,5 day. Additionally, sometimes, I need to change the dates from being sequential (2,3,4,5) to dates like: 5, 10, 15, 20. Any thoughts? – Eddie Fernandez Oct 08 '15 at 20:09
  • The rows for `EID==12345` are not in date order. Are they supposed to be? – jlhoward Oct 08 '15 at 22:57

1 Answers1

3

Here is a data.table solution.

library(data.table)
n      <- c(2,3,4,5)
result <- setDT(df)[,c(mean(Score),lapply(n,function(i){if(.N < i) as.numeric(NA) else mean(tail(Score,i))})),by=EID]
setnames(result,c("EID","avg_score",paste0("score",n)))
result
#      EID avg_score score2   score3 score4 score5
# 1:  1111  8.333333    7.5 8.333333     NA     NA
# 2: 12345  8.800000    8.5 8.666667    8.5    8.8
# 3: 19611  8.416667   10.0 8.666667    9.0    8.6

This reproduces your result, but as pointed out in the comment your rows are not in date order for all the EIDs. If that is important, use this:

setDT(df)[,Date:=as.Date(Date, format="%m/%d/%Y")]
setkey(df,EID,Date)     # ensures that Dates are ascending within EID
n      <- c(2,3,4,5)
result <- df[,c(mean(Score),lapply(n,function(i){if(.N < i) as.numeric(NA) else mean(tail(Score,i))})),by=EID]
setnames(result,c("EID","avg_score",paste0("score",n)))
result
#      EID avg_score score2   score3 score4 score5
# 1:  1111  8.333333    7.5 8.333333     NA     NA
# 2: 12345  8.800000    8.5 9.000000      9    8.8
# 3: 19611  8.416667   10.0 8.666667      9    8.6

If you need rollback averages other than (2,3,4,5), change the definition of n.

jlhoward
  • 58,004
  • 7
  • 97
  • 140