R - Fast way to calculate rolling mean with varying width

Question

I have a dataframe that contains bank assets for several dates (times). Each bank has a unique ID:

# Sample Data
time <- c(51, 52, 53, 55, 56, 51, 52, 51, 52, 53)
id <- c(1234, 1234, 1234, 1234, 1234, 2345, 2345, 3456, 3456, 3456)
name <- c("BANK A", "BANK A", "BANK A", "BANK A", "BANK A", "BANK B", "BANK B", "BANK C", 
          "BANK C", "BANK C")
assets <- c(5000, 6000, 4000, 7000, 8000, 10000, 12000, 30000, 35000, 40000)
df <- data.frame(time, id, name, assets)

> df
   time   id   name assets
1    51 1234 BANK A   5000
2    52 1234 BANK A   6000
3    53 1234 BANK A   4000
4    55 1234 BANK A   7000
5    56 1234 BANK A   8000
6    51 2345 BANK B  10000
7    52 2345 BANK B  12000
8    51 3456 BANK C  30000
9    52 3456 BANK C  35000
10   53 3456 BANK C  40000

For each bank I want to calculate the rolling mean of the assets, varying the width according to the number of consecutive time values. So the rolling mean shall include all availabe consecutive previous values of a bank's asssets. If there is no previous value availabe for one bank it shall equal assets. Therefor I add a column that counts the number of consecutive time-values and than use rollapplyrfrom the zoo package, which gives me the desired result, but with a large data set it is far too slow:

# Calculate number of consecutive times
require(dplyr)
df <- df %>%
  mutate(number.time = 1) %>% # insert column for number.time, start value = 1
  group_by(id) %>%
  arrange(time) # correct order for moving average

for(i in 2:nrow(df)) # Start loop in second row, end in last row of df
  df$number.time[i] <- 
    ifelse(df$time[i] == df$time[i-1]+1,    # Is time consecutive?
           df$number.time[i - 1] + 1,       # If yes: add 1 to previous number.time
           1)                               # If no: set number.time = 1
# Moving Average
require(zoo)
df %>%
  mutate(mov.average = rollapplyr(data = assets,
                                  width = number.time, # use number.time for width
                                  FUN = mean, 
                                  fill = NA,
                                  na.rm = TRUE))
Source: local data frame [10 x 6]
Groups: id [3]

    time    id   name assets number.time mov.average
   (dbl) (dbl) (fctr)  (dbl)       (dbl)       (dbl)
1     51  1234 BANK A   5000           1        5000
2     52  1234 BANK A   6000           2        5500
3     53  1234 BANK A   4000           3        5000
4     55  1234 BANK A   7000           1        7000
5     56  1234 BANK A   8000           2        7500
6     51  2345 BANK B  10000           1       10000
7     52  2345 BANK B  12000           2       11000
8     51  3456 BANK C  30000           1       30000
9     52  3456 BANK C  35000           2       32500
10    53  3456 BANK C  40000           3       35000

How could I get this output using a faster function? I'm aware of rollmean from zoo as well as SMA from TTR and mafrom forecast but these do not allow for varying width. My question may also be related to this question and this rblog, but I'm not familiar with C++ nor do I know a lot about function writing, so I do not really understand those posts.

EDIT 1: Note that in my code above it isn't the for-loop but the rollapplyr that takes a lot of time.

EDIT 2: The rolling mean shall include not more than the last 4 values. This is, as many consecutive values as there are according to the time-variable, but no more than the last 4 values. Sorry for the inexact question! :/ My wording was based on the assumption to use the "number.time"-column where it would have been easy to limit all values to maximum = 4.

You might need to apply `cumsum(assets) / seq_along(assets)` by (1) `id` and (2) `ave(df$time, df$id, FUN = function(x) cumsum(c(TRUE, (x[-1] - x[-length(x)]) != 1)))` — alexis_laz, May 04 '16 at 12:08
This works perfect and is very fast, of course. Unfortunately I noticed that my question was inexact: I want to calculate the average mean of not more but the last 4 values, this is as many as there are, but no more than the last 4 values. Do you see any possibilty to implemend this restriction into your code? My question above was based on the assumption I would use the column "number.time" so that I could simply limit it to 4, sorry for that... :/ — jb123, May 04 '16 at 13:37

G. Grothendieck · Accepted Answer · 2016-05-04T14:32:37.760

First create a grouping variable g and then compute the rolling means. Note that rollsum is substantially faster than rollapply but does not support partial necessitating the workaround shown:

library(zoo) # rollsum

g <- with(df, cumsum(ave(time, id, FUN = function(x) c(1, diff(x) != 1))))
roll4 <- function(x) rollsum(c(0, 0, 0, x), 4) / pmin(4, seq_along(x)) 
transform(df, avg = ave(assets, g, FUN = roll4))

giving:

   time   id   name assets   avg
1    51 1234 BANK A   5000  5000
2    52 1234 BANK A   6000  5500
3    53 1234 BANK A   4000  5000
4    55 1234 BANK A   7000  7000
5    56 1234 BANK A   8000  7500
6    51 2345 BANK B  10000 10000
7    52 2345 BANK B  12000 11000
8    51 3456 BANK C  30000 30000
9    52 3456 BANK C  35000 32500
10   53 3456 BANK C  40000 35000

This works perfectely, but regrettably my question was inexact as I meanwhile wrote in the edit above. In my large dataset I have data for each bank for upto 50 time-values, but I only want to include at most the last 4 values. Could this be implemented your approach, so that it calculates the mean of the last 4 values and if there are less avaible, than the mean of these? Sorry for my mistake! — jb123, May 04 '16 at 13:54

user31264 · Answer 2 · 2016-05-04T12:12:46.973

0

Use cumsum.

If you have just one bank, try:

cumsum(df$assets)/seq(nrow(df))

What to do if you have more than one bank, I leave as an excersize. Hint: you can completely avoid loops by using rle.

Here is the function "cumsum with restarts" which is supposed to help you.

cumsum.r <- function(vals, restart) {
    if (!is.vector(vals) || !is.vector(restart)) stop("expect vectors")
    if (length(vals) != length(restart)) stop("different length")
    # assume restart = FFTFFFTFFFFT
    len = length(vals) # 12
    restart[1]=T # TFTFFFTFFFFT
    ind = which(restart) # (1,3,7,12)
    ind = rep(ind, c(ind[-1],len+1)-ind) # 1,1,3,3,3,3,7,7,7,7,7,12
    vals.c = cumsum(vals)
    vals.c - vals.c[ind] + vals[ind]
}

edited May 04 '16 at 12:12

answered May 04 '16 at 08:51

user31264

6,557
3
26
40

I see how this works in general which is fine, thanks for the idea. But is it possible to "let the cumsum restart", when there is a break in the time-variable? In the sample data for Bank A there is no row for time = 54 (row 3 to 4). And would you mind to provide some details to the "exercise" you've left? ;-) – jb123 May 04 '16 at 09:17
Yes, I wrote the function which does the "cumsum with restarts", using only cumsum and rle. Hint: write the function cumsum.r(val, restart) which takes numerical vector val and boolean vector restart, and "restarts" the cumsum at points when restart=TRUE. – user31264 May 04 '16 at 10:55
For instance, if val=c(10,5,3,100,50) and restart=(F,F,F,T,F), the function should return (10,15,18,100,150). – user31264 May 04 '16 at 10:57
Of course, the function should not contain any loops. – user31264 May 04 '16 at 11:01
If you've already written the function, can't you just post it here? As I'm still a starter in R, so far I've written only one function and I have no idea how to do it in this case. Obviously it would take a whole lot of time to get it and I actually hoped, posting this question here would save me some time... – jb123 May 04 '16 at 11:48
Ok, here is it. It doesn't even need rle (but it needs which and rep). – user31264 May 04 '16 at 12:13

R - Fast way to calculate rolling mean with varying width

2 Answers2