1

I am using R. I know calculating moving average is a topic with several answers in this site, but I have some problems that make my question unique.

I have a data frame including 8784 hourly concentrations (366 days * 24 hours) of an air pollutant (Ozone). This data frame includes some NaN values (missing values). The procedure contains following steps:

1- calculating 8-hour moving (rolling) averages of hourly concentrations: i.e. every 8 concentrations should be averaged in this way: average of 1 to 8, average of 2 to 9, average of 3 to 10, etc. This leads to obtaining 24 moving averages for each day (every 24 hours).

2- for each day, I want the maximum of 8-hour moving averages: i.e. among the 24 moving averages, the highest number should be selected. Finally, 366 moving average (366 days) will be selected.

3- A new data frame containing 366 moving averages should be created.

I know there are some packages (openair, zoo, TTR) that do something like this, but are there any ways to write the codes without any packages?

An Exmaple of my data 

     ColName
1    18.76 
2    12.92 
3    8.12 
4    NaN 
5    12.92 
6    3.77 
7    18.76 
8    9.52 
9    94.09 
10    18.76 
11    14.13 
12    8.12 
13    2.04 
14    12.92 
15    9.17 
.
.
.
8783    34.58
8784    64.23 

The name of main data frame is "Hourly". I tried these codes:

Hourly1 <- c(0, cumsum(ifelse(is.nan(Hourly), 0, Hourly))) 
rsum <- (Hourly1[(Hourly1+1):length(Hourly1)] - Hourly1[1:(length(Hourly1) - 8)]) / 8

But when I try the first line, the following error occurs:

Error in is.nan(Hourly) : default method not implemented for type 'list'

UPDATE: I used the following codes, but the maximum of 8-hour averages is not calculated right:

Hourly2<-as.numeric(Hourly$Average)

names(Hourly2) <- rep(seq.Date(as.Date("2017-01-01"), by=1, length.out=366), each=24)

x<-Hourly2
#use cumsum to get the moving average, keep NaNs
cx <- c(0, cumsum(ifelse(is.nan(x), 0, x))) + c(0,x)*0

n <- 8

rsum <- (cx[(n+1):length(cx)] - cx[1:(length(cx) - n)]) / n

res <- data.frame(mov_avg=rsum, days=names(rsum))


#select max from each day, ignoring NaN's
mx <- aggregate(mov_avg~days, data=res, max)

I compared the final results(366 maximum of 8-hour averages, each for 1 day of year) with a standard pre-approved dataset. In some days, the codes calculated averages correctly, but in some other days not! I did not get its logic.

You can find my raw dataset here!

UPDATE 2:

Here is a link to download the final results produced by different methods!

UPDATE3:

The difference between the results was due to the different methods for calculating moving averages. There are three methods for calculating moving averages: left, right, and center. The codes proposed by the guys here follow the "right" method.

Oscar
  • 79
  • 5
  • 1
    There is an answer [here](https://stackoverflow.com/questions/743812/calculating-moving-average) using `cumsum` in base. – Esther Jun 04 '18 at 23:49
  • @Esther I did that, but the following error occurs: Error: (list) object cannot be coerced to type 'double' – Oscar Jun 05 '18 at 00:03
  • Can you add a small sample of your data and the code you tried? Also how do you want the NaN's to be treated? For example do you want to exclude them and take the average of 7 values when they appear? Or to not use the time period around the NaN? – Esther Jun 05 '18 at 00:21
  • @Esther I updated the question. Please see the example of data frame and the codes I tried. I want the missing values to be excluded, but the average of that 8-h section be calculated. – Oscar Jun 05 '18 at 00:49
  • are you looking for the function `rollapply?` ie to roll the means: `zoo::rollapply(1:10,2,means)` this finds the means width=2 – Onyambu Jun 05 '18 at 00:59
  • @Onyambu Yes. Kind of! but I want it be independent from packages. In addition, I want some more actions, as described in number 2 of my question. – Oscar Jun 05 '18 at 01:12
  • If you were to imitate rollapply, then eg `s=1:10;w=2;rollapply(s,w,mean)` is equivalent to `sapply(1:(length(s)-w+1),function(x)mean(s[x:(x+w-1)]))` in base... – Onyambu Jun 05 '18 at 01:21
  • you should do `is.nan(Hourly$ColName)` – Onyambu Jun 05 '18 at 01:23
  • @Onyambu Thanks. What is "s", "w", and "x" in this code? can you define it? – Oscar Jun 05 '18 at 06:37
  • s is the vector you want to perform the rolling means, w is the window of the rolling means.. eg in youe case w=8. x in the argument of the function. so nothing about it – Onyambu Jun 05 '18 at 06:58

2 Answers2

1

Here's an example of how to do it with cumsum when you have missing values. I would be careful to consider how they're distributed in your data and how you want to deal with them though.

#create some sample data
set.seed(1)
x <- rnorm(24*366)
names(x) <- rep(seq.Date(as.Date("2017-01-01"), by=1, length.out=366), each=24)
x[sample(100, 1:length(x))] <- NaN #add some missing values

#use cumsum to get the moving average, keep NaNs
cx <- c(0, cumsum(ifelse(is.nan(x), 0, x))) + c(0,x)*0

n <- 8

rsum <- (cx[(n+1):length(cx)] - cx[1:(length(cx) - n)]) / n

res <- data.frame(mov_avg=rsum, days=names(rsum))

#select max from each day, ignoring NaN's
mx <- aggregate(mov_avg~days, data=res, max)

days   mov_avg
1 2017-01-01 0.6404849
2 2017-01-02 0.3456389
3 2017-01-03 0.5998888
4 2017-01-04 0.6635502
5 2017-01-05 0.7244289
6 2017-01-06 0.1715349
Esther
  • 1,115
  • 1
  • 10
  • 15
  • When I use this code: Hourly1<- c(0, cumsum(ifelse(is.na(Hourly), 0, Hourly))) + c(0,Hourly)*0 I get this error: Error: (list) object cannot be coerced to type 'double' – Oscar Jun 05 '18 at 01:14
  • "Hourly" needs to be a single vector of the measurements. You need to index it with something like `Hourly$ColName` – Esther Jun 05 '18 at 01:18
  • It calculated 8777 of the 8-hour rolling averages. But when I run the code (res <- data.frame(mov_avg=rsum, days=names(rsum))), this error occurs: Error in data.frame(mov_avg = rsum, days = names(rsum)) : arguments imply differing number of rows: 8777, 0 – Oscar Jun 05 '18 at 01:28
  • It's because your vector doesn't have names associated with it like in the example. In order to get the daily maximum, you have some way to tell the function which values go with which days. – Esther Jun 05 '18 at 01:34
  • I run the codes successfully, but there is a problem. When I checked the final results with the output of same dataset in Excel, I realized that the mathematics of codes has mistakes. I am sure about that formula in Excel, could you check the codes again? The part that calculates 8-hour averages is correct. But the part that calculates maximum of 8-hour averages in each day is yielding different numbers. Thank you – Oscar Jun 05 '18 at 12:09
  • @Oscar, I don't think there are any errors with my example data. I'm afraid without seeing your data and code I'm not sure what's causing the discrepancy. Can you show what you used? – Esther Jun 05 '18 at 16:50
  • I tested the codes with my own dataset. I used exactly your codes. I will update the question some minutes later to add codes. I think something is wrong with the dates, and the fact that the first lines calculates 8777 8-hour averages! I think this disturbs the dates. – Oscar Jun 05 '18 at 21:30
  • Thanks Oscar, can you give me an example of one date that produces an incorrect estimate and what the correct value should be? – Esther Jun 05 '18 at 22:09
  • I will update the question some minutes later, and provide a link to download the final results. Just remember that, this file contains the final results (366 maximum 8-hour averages, each for 1 day in year). You will see 3 colums: first, is my results with Excel. Second, is those totally produced by your codes. Third, is those that the first step (calculating just 8-hour averages) is done by your codes, but the second step (calculating the maximum of 8-hour averages) is done by my formula in Excel. Thanks alot. – Oscar Jun 06 '18 at 03:12
  • Dear Esther! Could you plz look at the files. Thank you. – Oscar Jun 07 '18 at 10:27
  • @Oscar, thanks for posting your data, I can now see how the discrepancy was caused. In my code, the first 7 hours are censored because there is not enough information to calculate an 8-hour moving average yet. So the first date should have only 17 average measurements, but the rest have 24. What I believe has happened in your data is that the first 24 averages were still assigned to the first day, giving the last day only had 17 averages, and the maximum taken from those assignments. I believe my data are correct. – Esther Jun 07 '18 at 15:58
  • Thank you Esther. I checked the outputs. Actually both are correct. There are several methods to calculate rolling averages. You have used the "right" method, I have used the "left" one. I have considered the first hour as the starting point, and you have considered the 8th hour as the starting point. Both are correct. Thanks – Oscar Jun 08 '18 at 05:22
0

I've been working on exactly that, and found a solution that uses map2()

# create a day of ozone data  

o3day <- data.frame(o3hrly = runif(24, 0.04, 0.1))

# 8hr average function
avg_8hr <- function(.x, .y, o3) {
  # print(.x)
  # print(.y)
  # print(o3)
  o3 %>% slice(.x:.y) %>% summarize(o38hr = mean(o3hrly))
}

max(unlist(map2(.x = 1:17, .y = 8:24, .f = avg_8hr, o3 = o3day)))
LightonGlass
  • 176
  • 1
  • 13