2

I have a dataframe showing a date, an item and a value, and I want to add a column showing the average of its 50 previous entries (or NA if it hasn't had 50) e.g. the table could be

      data
date     item value  
01/01/01 a    2  
01/01/01 b    1.5  
04/01/01 c    1.7  
05/01/01 a    1.9  
......

and part of it could become

date     item value last_50_mean   
........ 
11/09/01 a    1.2   1.1638
12/09/01 b    1.9   1.5843 
12/09/01 a    1.4   1.1621
13/09/01 c    0.9   NA
........

So in this case the mean of a in the 50 entries before 11/09/01 is 1.1638 and c hasn't had 50 entries before 13/09/01 so returns NA

I am currently doing this using the following function

  data[, 'last_50_mean'] <- sapply(1:nrow(data), function(i){
        prevDates <- data[data$date < data$date[i] & data$item == data$item[i], ]
        num       <- nrow(prevGames)
        if(nGames >= 50){
          round(mean(prevDates[(num- 49):num, ]$value), 4)
        }
      }
  )

But my dataframe is large and it is taking a long time (in fact I'm not 100% sure it works as it is still running... Does anyone know of the best way to do this?

BenMorel
  • 34,448
  • 50
  • 182
  • 322
user1165199
  • 6,351
  • 13
  • 44
  • 60
  • `library(xts); ?rollmean`. Also, see [here](http://r.789695.n4.nabble.com/Calculating-rolling-mean-by-group-td4280511.html) for a related question. Finally, please review [this question](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) to make your question reproducible so others can easily try and help you. – Chase May 23 '12 at 16:01
  • Did you say your dataframe was large? Try [this recent answer](http://stackoverflow.com/questions/10721180/how-do-i-speed-up-this-for-loop-with-data-table-lapply). – Matt Dowle May 23 '12 at 16:45

1 Answers1

4

The mean of N observations can be calculated from the cumulative sum and the difference between the first and last value, diff(cumsum(x), lag=N - 1). Your question wants the first N - 1 values to be padded, so

meanN <- function(x, N=50)
    ## mean of last N observations, padded in front with NA
{
    x0 <- x[seq_len(length(x) - N + 1)]
    x1 <- (x0 + diff(cumsum(x), lag=N-1)) / N
    c(rep(NA, N - 1), x1)
}

You'd like to do this for several groups. For a data.frame like

df <- data.frame(item=sample(letters[1:3], 1000, TRUE),
                 value=runif(1000, 1, 3),
                 last_50_mean=NA)

one way of doing this is

split(df$last_50_mean, df$item) <- lapply(split(df$value, df$item), meanN)

leading to for instance

> tail(df)
     item    value last_50_mean
995     c 1.191486     2.037707
996     c 2.899214     2.073022
997     c 2.019375     2.054914
998     c 2.737043     2.066389
999     a 1.703752     1.923234
1000    c 1.602442     2.043517

This assumes that your data frame is ordered by time. A potential problem is when long vectors overflow cumsum; one could address this by centering value so the expectation is that cumsum doesn't stray too far from zero. A recent question addressed alternatives to split<- and dropping the last N observations.

Community
  • 1
  • 1
Martin Morgan
  • 45,935
  • 7
  • 84
  • 112
  • Many thanks Martin, this does the job. Only slight thing was that the mean included the current value which I didn't want it to do so I changed the end of the meanN function to c(rep(NA, N), x1[1:(length(x1) - 1)]) to fix this – user1165199 May 24 '12 at 12:50