0

I'm trying to resample a dataset of hourly Ozone measurements from this source - https://aqs.epa.gov/aqsweb/airdata/hourly_44201_2016.zip

Here is the head of the data:

structure(list(date_time = structure(c(1456844400, 1456848000, 
1456851600, 1456855200, 1456858800, 1456862400, 1456866000, 1456869600, 
1456873200, 1456880400, 1456884000, 1456887600, 1456891200, 1456894800, 
1456898400, 1456902000, 1456905600, 1456912800, 1456916400, 1456920000, 
1456923600, 1456927200, 1456930800, 1456934400, 1456938000, 1456941600, 
1456945200, 1456948800, 1456952400, 1456956000), class = c("POSIXct", 
"POSIXt"), tzone = "UTC"), Sample.Measurement = c(0.041, 0.041, 
0.042, 0.041, 0.038, 0.038, 0.036, 0.035, 0.029, 0.026, 0.03, 
0.03, 0.028, 0.027, 0.025, 0.023, 0.025, 0.034, 0.036, 0.038, 
0.041, 0.042, 0.043, 0.043, 0.041, 0.033, 0.01, 0.01, 0.011, 
0.007)), .Names = c("date_time", "Sample.Measurement"), row.names = c(NA, 
30L), class = "data.frame")

I've combined the local date and time columns to create a datetime using Lubridate:

df$date_time = ymd_hm(paste(df$Date.Local, df$Time.Local))

What I then want to do is resample the Sample.Measurement data into an eight-hour rolling mean. From there I want to then select the max value for each day.

In Pandas, this would be trivial using the resample() method.

How do I do this in R - Dplyr?

elksie5000
  • 7,084
  • 12
  • 57
  • 87
  • 1
    Please read [*"How to make a great R reproducible example?"*](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610). Including example data which can easily be copy-pasted, makes it easier for others to help you. – Jaap Jan 20 '18 at 19:23

1 Answers1

3

You could use rollmean from the zoo package and group_by and summarise from dplyr as follows. Edited the answer such that you get the maximum for each day and month. If your data covers more than a year, create a year column as well (simply uncomment the third line in the call to mutate) and then group_by day, month and year.

library(zoo)
library(dplyr)
library(lubridate)
df %>% 
 mutate(day = as.factor(day(date_time)),
        month = as.factor(month(date_time),
        #year = as.factor(year(date_time)),
        rolling_mean = rollmean(.$Sample.Measurement,
                                k = 8,
                                fill = NA,
                                align = "center")) %>% 
 group_by(day, month) %>% 
 summarise(max_day = max(rolling_mean, na.rm = TRUE)) %>% 
 ungroup()
 # A tibble: 2 x 3
   day   month max_day
 <fct> <fct>   <dbl>
 1 1     3      0.0390
 2 2     3      0.0398

The argument align = "center" is the default and hence unnecessary. I just wanted you to notice that your results might depend on it.

markus
  • 25,843
  • 5
  • 39
  • 58