Calculate centered moving average of dataset, ignoring NAs, output same length as input?

Question

Ultimately I want to create a 15 hour moving average of my dataset. The data I am working with has a date and time for every 15 minutes. I need the window centered (so 30 time steps before/after the row I'm looking at). I also need to be able to link the moving average data back in with my datetime data, so I need it to be the same length as my original dataframe.

I have tried this using a number of functions (movag, MAVE, rollmean, ...), but have not been able to find the correct combination to do both a centered average and pad the result with the appropriate number of NA results.

Expend your question with an example of your data and some code you tried, preferably with example what expected result should look like. Btw why are you not able to "pad" result manually with something like `c(rep(NA, 30), vector_of_ma, ...` — jyr, Jan 17 '20 at 21:57

Allan Cameron · Accepted Answer · 2020-01-18T12:28:57.250

You can define a function that calculates a moving average with any window size you choose, and by default fills either end with NA if you only want the rolling average to include a full window's worth of values. If you want it to average the values within a partial window, you can specify that too.

It will handle any NAsin the input, and if all the inputs in a particular window are NA it will return an NA in the appropriate point of the output vector.

This function could be a lot shorter, but I have written it this way for clarity, and have included basic error checking and explanatory comments.

moving_average <- function(vec, window, full.window.only = TRUE)
{
  # Define the size of the window on either side
  half_window <- window %/% 2

  # Ensure the vector is long enough to have at least one window
  stopifnot(length(vec) > window) 

  # Get the indices we want to average
  indices <- lapply(seq_along(vec), 
                    function(y) {
                      z <- y + 0:(2 * half_window) - half_window;
                      z[z > 0 & z <= length(vec)]})

  # Get the rolling mean at each of our indices, handling NAs as we go
  result <- sapply(indices, 
                   function(x){
                     if(all(is.na(vec[x]))) return(NA)
                     else return(mean(vec[x], na.rm = TRUE))})

  # Insert NAs if we don't want partial means at either end
  if(full.window.only) 
  {
    result[1:half_window] <- NA
    result[(length(vec) - half_window + 1):length(vec)] <- NA
  }

  return(result)
}

I'll show an example here, by trying to recreate some sample data from your description:

set.seed(1) # Ensures the random numbers are reproducible

df <- data.frame(times  = as.POSIXct("2019-12-25 09:00:00") + 1:20 * 900,
                 values = rnorm(20, 20, 4))

The data frame looks like this:

                 times   values
1  2019-12-25 09:20:00 17.49418
2  2019-12-25 09:40:00 20.73457
3  2019-12-25 10:00:00 16.65749
4  2019-12-25 10:20:00 26.38112
5  2019-12-25 10:40:00 21.31803
6  2019-12-25 11:00:00 16.71813
7  2019-12-25 11:20:00 21.94972
8  2019-12-25 11:40:00 22.95330
9  2019-12-25 12:00:00 22.30313
10 2019-12-25 12:20:00 18.77845
11 2019-12-25 12:40:00 26.04712
12 2019-12-25 13:00:00 21.55937
13 2019-12-25 13:20:00 17.51504
14 2019-12-25 13:40:00 11.14120
15 2019-12-25 14:00:00 24.49972
16 2019-12-25 14:20:00 19.82027
17 2019-12-25 14:40:00 19.93524
18 2019-12-25 15:00:00 23.77534
19 2019-12-25 15:20:00 23.28488
20 2019-12-25 15:40:00 22.37561

For the purposes of this example, I'll set the window size to 5 (the value and the two measurements before and after it). You'll want to set yours to 30 (or maybe 60, I wasn't sure from your question). All I have to do is:

df$rolling_average <- moving_average(df$values, 5)

Now df looks like this:

                 times   values rolling_average
1  2019-12-25 09:15:00 17.49418              NA
2  2019-12-25 09:30:00 20.73457              NA
3  2019-12-25 09:45:00 16.65749        20.51708
4  2019-12-25 10:00:00 26.38112        20.36187
5  2019-12-25 10:15:00 21.31803        20.60490
6  2019-12-25 10:30:00 16.71813        21.86406
7  2019-12-25 10:45:00 21.94972        21.04846
8  2019-12-25 11:00:00 22.95330        20.54054
9  2019-12-25 11:15:00 22.30313        22.40634
10 2019-12-25 11:30:00 18.77845        22.32827
11 2019-12-25 11:45:00 26.04712        21.24062
12 2019-12-25 12:00:00 21.55937        19.00824
13 2019-12-25 12:15:00 17.51504        20.15249
14 2019-12-25 12:30:00 11.14120        18.90712
15 2019-12-25 12:45:00 24.49972        18.58229
16 2019-12-25 13:00:00 19.82027        19.83435
17 2019-12-25 13:15:00 19.93524        22.26309
18 2019-12-25 13:30:00 23.77534        21.83827
19 2019-12-25 13:45:00 23.28488              NA
20 2019-12-25 14:00:00 22.37561              NA

And to check the results visually, let's plot the rolling average as a line over the points:

plot(df$times, df$values, xlab = "Time", ylab = "Value", main = "Moving average")
lines(df$times, df$rolling_average, col = "red")

Which looks like this:

Calculate centered moving average of dataset, ignoring NAs, output same length as input?

1 Answers1