Looking for an easier way to calculate daily and hourly averages in R

Question

I am able to solve my problem successfully using the following code in R. But I am looking if there is a possibility to do it in a simpler way. I am calculating 60min average and daily average. The preprocessed data is stored in 'merged' object. Now there are certain protocols of my task such as (i) Time should be in yyyy-mm-dd hh:mm:ss+00 format (ii) daily average should be computed using 60min average only (and not using 'merged' object).

Code:

#Cal 60min average
avghr <- merged %>%
 select(any_of(cn)) %>%          #cn includes the selected columns in the output  
 as_tibble() %>%
 group_by(Time_sp = lubridate::floor_date(`Date/Time`, "60 mins")) %>% 
 mutate(Time_sp = format(Time_sp, "%Y-%m-%d %H:%M:%S+00")) %>%
 summarise(across(where(is.numeric), ~ if(mean(is.na(.x)) > 0.5) NA else mean(.x, na.rm = TRUE)))
write.csv(avghr, paste0(dirlist[idx],"_hr.csv"), row.names = FALSE)

#60min average to be used for daily average calc
avghr1 <- merged %>%
 select(any_of(cn)) %>%
 as_tibble() %>%
 group_by(Time_sp = lubridate::floor_date(`Date/Time`, "60 mins")) %>% 
 summarise(across(where(is.numeric), ~ if(mean(is.na(.x)) > 0.5) NA else mean(.x, na.rm = TRUE)))
 
#Calc daily average
avgdl <- avghr1 %>%
 group_by(Time_sp = lubridate::floor_date(`Time_sp`, "1 day")) %>%
 mutate(Time_sp = format(Time_sp, "%Y-%m-%d %H:%M:%S+00")) %>%
 summarise(across(where(is.numeric), ~ if(sum(is.na(.x)) > 1) NA else mean(.x, na.rm = TRUE)))
write.csv(avgdl, paste0(dirlist[idx],"_dly.csv"), row.names = FALSE)

To have Time in the specified format, I have used mutate function but mutate function converts Date/Time format to string format. Thus, Time_sp column in avghr is no longer in Date/Time format, hence, is no longer valid for grouping of 1 day and further calculation of daily average. Thus, as a way out, I have calculated 60min average as avghr, used mutate to have Time in the required format and print it as .csv output. Then I again calculated 60min average as avghr1, kept Time_sp in Date/Time format as intact and, therefore successfully used avghr1 to calculate daily average avgdl. Is there any way to avoid calculating avghr1 and still achieve the same results in desired manner. I hope that I have made my problem clear.

Input data

Date/Time C1 [C1c] C2 [C2c] C3 [C3c] C4 [C3c] C5 [h]               
2021-03-01S00:00:00 267 295 250 .....
2021-03-01S00:01:00 .....
2021-03-01S00:02:00 .....
..

dput() of a sample input .txt file

structure(list(`Date/Time` = structure(c(1614556800, 1614556860, 
1614556920, 1614556980, 1614557040, 1614557100), tzone = "UTC", class = 
c("POSIXct", "POSIXt")), `XY [XY]` = c(0.990641, 0.990641, 0.990641, 
0.990641, 0.990641, 0.990641), `C1 [C1]` = c(257, 257, 257, 256, 255, 
255), Cc = c(0, 0, 0, 0, 0, 0), `C2 [C2]` = c(285, 284, 289, 264, 
231, 223), Dc = c(0, 0, 0, 0, 0, 0), `C3 [C3]` = c(255, 255, 
255, 255, 254, 254), Ac = c(0, 0, 0, 0, 0, 0), C4 = c(0.463735, 
0.465678, 0.467612, 0.469561, 0.471472, 0.473374), `C5 [h]` = c(1013, 
NA, NA, NA, NA, NA), `C6 [%]` = c(43, NA, NA, NA, NA, NA), `C7 [E2]` = 
c(390, 390, 393, 380, 365, 361), Jc = c(0, 0, 0, 0, 0, 0), `D [S]` = 
c(62.3716, 62.2459, 62.1206, 61.9942, 61.8701, 61.7465), `Sw [S2]` = 
c(1392.95, 1392.95, 1392.95, 1392.95, 1392.95, 1392.95), `SW [Q2]` = 
c(389.164, 389.253, 392.14, 379.964, 363.91, 360.562), `QA [H2]` = 
c(646.61, 649.313, 652.002, 654.712, 657.371, 660.016), `T2 [C]` = 
c(3.7, NA, NA, NA, NA, NA), Lc = c(0, 0, 0, 0, 0, 0)), row.names = c(NA, 
-6L), class = c("tbl_df", "tbl", "data.frame"))

Desired output

Hourly output
Time_sp C1 [C1c] C3 [C3c] C5 [C5c]....               
2021-03-01 01:00:00+00 257 285 255 .....
2021-03-01 02:00:00+00  .....
2021-03-01 03:00:00+00  .....
..

Daily output
Time_sp C1 [C1c] C3 [C3c] C5 [C5c]...               
2021-03-01 00:00:00+00 257 285 255 .....
2021-03-02 00:00:00+00  .....
2021-03-03 00:00:00+00  .....
..

R holds dates in the Date format which is always YYYY-MM-DD, and holds datetimes typically in the POSIXct format which is always YYYY-MM-DD HH:MM:SS TZ". If you need to show dates or datetimes in other formats this will by necessity convert them to text. It sounds like a better approach for you would be to keep in one column a raw date/datetime for R calculations, and a "for display" version in another column, e.g. by using `mutate(Time_sp_display = ...` in that first mutate line. — Jon Spring, May 30 '23 at 17:53
You could make your 2nd and 3rd chains into one shorter chain, but it sounds like you can't shorten it that much because you explicitly need two steps of grouped summarization, if you want the daily average to be the average of the observed hourly averages and not just the average of all the day's values. — Jon Spring, May 30 '23 at 17:56
It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. — MrFlick, May 30 '23 at 18:15
Please take another look at the "Providing a minimal dataset" section of the link above, esp the "Why use dput()?" part. https://stackoverflow.com/a/5963610/6851825. This is particularly important in R questions related to data types and formats, as many objects in R *print out* similarly, making it harder for people to understand and help. — Jon Spring, May 30 '23 at 21:01

score 1 · Accepted Answer · answered May 31 '23 at 12:14

Create a function for formatting and writing, let it retun input dataframe inivisibly so it would not affect further processing when used in a pipeline:

library(dplyr)

# write csv with custom datetime format, return input data.frame as-is, ivisibly
write.csv_custom_datetime <- function(df_, filename){
  mutate(df_, Time_sp = format(Time_sp, "%Y-%m-%d %H:%M:%S+00")) %>%
    write.csv(filename, row.names = FALSE)  
  invisible(df_)
}

avghr <- merged %>%
  # select(any_of(cn)) %>%          #cn includes the selected columns in the output  
  as_tibble() %>%
  group_by(Time_sp = lubridate::floor_date(`Date/Time`, "60 mins")) %>% 
  summarise(across(where(is.numeric), ~ if(mean(is.na(.x)) > 0.5) NA else mean(.x, na.rm = TRUE))) %>% 
  # write CSV while keeping POSIXct for avghr
  write.csv_custom_datetime(paste0(dirlist[idx],"_hr.csv"))
# CSV file content:
# readLines(paste0(dirlist[idx],"_hr.csv")) %>% stringr::str_trunc(80) %>% paste0(collapse = "\n") %>% cat()
#> "Time_sp","XY [XY]","C1 [C1]","Cc","C2 [C2]","Dc","C3 [C3]","Ac","C4","C5 [h]...
#> "2021-03-01 00:00:00+00",0.990641,256.166666666667,0,262.666666666667,0,254.6...

#Calc daily average
avgdl <- avghr %>%
  group_by(Time_sp = lubridate::floor_date(`Time_sp`, "1 day")) %>%
  summarise(across(where(is.numeric), ~ if(sum(is.na(.x)) > 1) NA else mean(.x, na.rm = TRUE))) %>% 
  write.csv_custom_datetime(paste0(dirlist[idx],"_dly.csv"))
# CSV file content:
# readLines(paste0(dirlist[idx],"_dly.csv")) %>% stringr::str_trunc(80) %>% paste0(collapse = "\n") %>% cat()
#> "Time_sp","XY [XY]","C1 [C1]","Cc","C2 [C2]","Dc","C3 [C3]","Ac","C4","C7 [E2...
#> "2021-03-01 00:00:00+00",0.990641,256.166666666667,0,262.666666666667,0,254.6...

Prepare reprex:

merged  <- structure(list(`Date/Time` = structure(c(1614556800, 1614556860, 
1614556920, 1614556980, 1614557040, 1614557100), tzone = "UTC", class = 
c("POSIXct", "POSIXt")), `XY [XY]` = c(0.990641, 0.990641, 0.990641, 
0.990641, 0.990641, 0.990641), `C1 [C1]` = c(257, 257, 257, 256, 255, 
255), Cc = c(0, 0, 0, 0, 0, 0), `C2 [C2]` = c(285, 284, 289, 264, 
231, 223), Dc = c(0, 0, 0, 0, 0, 0), `C3 [C3]` = c(255, 255, 
255, 255, 254, 254), Ac = c(0, 0, 0, 0, 0, 0), C4 = c(0.463735, 
0.465678, 0.467612, 0.469561, 0.471472, 0.473374), `C5 [h]` = c(1013, 
NA, NA, NA, NA, NA), `C6 [%]` = c(43, NA, NA, NA, NA, NA), `C7 [E2]` = 
c(390, 390, 393, 380, 365, 361), Jc = c(0, 0, 0, 0, 0, 0), `D [S]` = 
c(62.3716, 62.2459, 62.1206, 61.9942, 61.8701, 61.7465), `Sw [S2]` = 
c(1392.95, 1392.95, 1392.95, 1392.95, 1392.95, 1392.95), `SW [Q2]` = 
c(389.164, 389.253, 392.14, 379.964, 363.91, 360.562), `QA [H2]` = 
c(646.61, 649.313, 652.002, 654.712, 657.371, 660.016), `T2 [C]` = 
c(3.7, NA, NA, NA, NA, NA), Lc = c(0, 0, 0, 0, 0, 0)), row.names = c(NA, 
-6L), class = c("tbl_df", "tbl", "data.frame"))

dirlist <- c("tmp")
idx <- 1

^{Created on 2023-05-31 with reprex v2.0.2}

Looking for an easier way to calculate daily and hourly averages in R

1 Answers1

Linked