6

I am trying to compute rolling means of an unbalanced data set. To illustrate my point I have produced this toy example of my data:

ID  year  Var   RollingAvg(Var)
1   2000  2     NA
1   2001  3     2
1   2002  4     2.5
1   2003  2     3
2   2001  2     NA
2   2002  5     2
2   2003  4     3.5

The column RollingAvg(Var) is what I want, but can't get. In words, I am looking for the rolling average of ALL the previous observations of Var for each ID. I have tried using rollapply and ddply in the zoo and the plyr package, but I can't see how to set the rolling window length to use ALL the previous observations for each ID. Maybe I should use the plm package instead? Any help is appreciated.

I have seen other posts on rolling means on BALANCED panel data set, but I can't seem to extrapolate their answers to unbalanced data.

Thanks,

M

Michael Ohlrogge
  • 10,559
  • 5
  • 48
  • 76
Mace
  • 1,259
  • 4
  • 16
  • 35
  • I don't understand why is the 5th row of `RollingAvg(Var)` `NA` ? – Jd Baba Apr 19 '13 at 18:34
  • I think it is computing for each `ID` – Metrics Apr 19 '13 at 18:37
  • Is your expected output correct? – Metrics Apr 19 '13 at 18:48
  • @Jdbaba and @user1493368: The ´NA´ s are there because it is the first observation for that ID, and I want the mean of the PREVIOUS observations, so I would like the first observation of `RollingAvg(Var)` to be `NA`. – Mace Apr 19 '13 at 19:03
  • See also http://stackoverflow.com/questions/19894471/simple-moving-average-on-an-unbalanced-panel-in-r and http://stackoverflow.com/questions/743812/calculating-moving-average-in-r – Michael Ohlrogge Aug 22 '16 at 18:08

2 Answers2

7

Using data.table:

library(data.table)
d = data.table(your_df)

d[, RollingAvg := {avg = cumsum(Var)/seq_len(.N);
                   c(NA, avg[-length(avg)])},
    by = ID]

(or even simplified)

d[, RollingAvg := c(NA, head(cumsum(Var)/(seq_len(.N)), -1)), by = ID]
Arun
  • 116,683
  • 26
  • 284
  • 387
eddi
  • 49,088
  • 6
  • 104
  • 155
  • 2
    you should never use `DT$x = ...` with `data.table` this copies the whole table, which is precisely what it tries not to do. Use `:=` instead (read the vignette) – statquant Apr 19 '13 at 19:16
  • @eddi: Thanks, that works! Still trying to understand what is going on, but I will probably get there :) Is it possible to extend your answer so that the first say 2 observations get coded ´NA´ instead of only the first? (I know it's not in the original question) – Mace Apr 19 '13 at 19:23
  • sure, what's going on is I compute the cumulative sums and then divide that by the number of observations until then, which is really the definition of the mean that you want (run `cumsum` and `seq_len` separately and see what they do); after that I just shift it to the form that you like - if you want to shift it more, just add 2 `NA`'s in front and take out two elements from the tail – eddi Apr 19 '13 at 19:26
  • I also figure that you probably want to double check that your DT object is properly sorted by time before applying this. – Michael Ohlrogge Aug 22 '16 at 18:11
2

Assuming that years are contiguous within each ID (which is case in the example data) and DF is the input data frame, here is a solution using just base R. cumRoll is a function that performs the required operation on one ID and ave then performs it by ID:

cumRoll <- function(x) c(NA, head(cumsum(x) / seq_along(x), -1))
DF$Roll <- ave(DF$Var, DF$ID, FUN = cumRoll)

The result is:

> DF
  ID year Var Roll
1  1 2000   2   NA
2  1 2001   3  2.0
3  1 2002   4  2.5
4  1 2003   2  3.0
5  2 2001   2   NA
6  2 2002   5  2.0
7  2 2003   4  3.5
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341