Iterate through column names to get different type of functions summarized by week in r dataframe using dplyr

Question

I am trying to iterate through global health epidemic data on a database which consists of daily cases, cumulative cases, daily deaths, and cumulative deaths (as well as some other covariables which aren't really relevant here). The table is structured as follows: For each country (with country name listed, region, ID) and each date (though not all dates are displayed for all countries*) the daily/cumulative cases/deaths/etc. are listed.

The data looks something like this:

# A tibble: 40 x 7
   iso_code continent location    date       total_cases new_cases week   
   <chr>    <chr>     <chr>       <date>           <dbl>     <dbl> <chr>  
 1 AFG      Asia      Afghanistan 2020-02-24           5         5 2020-08
 2 AFG      Asia      Afghanistan 2020-02-25           5         0 2020-08
 3 AFG      Asia      Afghanistan 2020-02-26           5         0 2020-08
 4 AFG      Asia      Afghanistan 2020-02-27           5         0 2020-08
 5 AFG      Asia      Afghanistan 2020-02-28           5         0 2020-08
 6 AFG      Asia      Afghanistan 2020-02-29           5         0 2020-08
 7 AFG      Asia      Afghanistan 2020-03-01           5         0 2020-09
 8 AFG      Asia      Afghanistan 2020-03-02           5         0 2020-09
 9 AFG      Asia      Afghanistan 2020-03-03           5         0 2020-09
10 AFG      Asia      Afghanistan 2020-03-04           5         0 2020-09
# ... with 30 more rows

I need to summarize the daily data into weekly data. Of course, this is no problem for one column: using methods described here I should be able to aggregate the data for each week, for each country as follows~

library(dplyr)
sumByColumn <- function(df, colName) {
# the method for daily (cases/deaths)/(cases/deaths) smoothed
  df %>%
    group_by(location, week) %>%
    summarize(colName = sum(!! sym(colName)))
}
idByColumn <- function(df, colName) {
# the method for cumulative (cases/deaths)
  df %>%
    group_by(location, week) %>%
    summarize(colName = identity(!! sym(colName)))
}

(It should be noted that, obviously, daily case/death data will be summarized, whereas cumulative case/death data will be simply the identity function as given. These columns, in the list of column names of df, are denoted as id_cols.)

However, when I try to run the sumByColumn()/idByColumn() loop along the entire dataframe df, I run into this error:

for (col in 1:ncol(df)) {
  colName = colnames(df)[col]
  if (col%in%id_cols) {
    df_weekly = idByColumn(df_weekly,colName)
  } else {
    df_weekly = sumByColumn(df_weekly,colName)
  }
}

I get:

Error in !sym(colName) : invalid argument type

Note: I have computed the frequency by which the number of times each country appears in the dataframe, which corresponds to the number of days the disease was tracked. Is there a way to account for this, e.g. when I go through the weeks, if there is no data for that week, or an uneven number of countries per week give data, to ignore it and not return NA?

Edit:

R Session Info is:

R version 4.1.2 (2021-11-01)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=English_(Country).1252  LC_CTYPE=English_(Country).1252    LC_MONETARY=English_(Country).1252
[4] LC_NUMERIC=C                    LC_TIME=English_(Country).1252    
system code page: 65001

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] nnet_7.3-17           plyr_1.8.6            car_3.0-12            carData_3.0-5        
 [5] nlme_3.1-153          lubridate_1.8.0       gridExtra_2.3         ExcelFunctionsR_0.1.4
 [9] forcats_0.5.1         stringr_1.4.0         purrr_0.3.4           readr_2.1.2          
[13] tidyr_1.2.0           tibble_3.1.6          ggplot2_3.3.5         tidyverse_1.3.1      
[17] dplyr_1.0.7           readxl_1.3.1          poisson_1.0          

loaded via a namespace (and not attached):
 [1] tseries_0.10-49   httr_1.4.2        jsonlite_1.7.3    splines_4.1.2     modelr_0.1.8     
 [6] assertthat_0.2.1  TTR_0.24.3        sp_1.4-6          roperators_1.2.0  cellranger_1.1.0 
[11] pillar_1.7.0      backports_1.4.1   lattice_0.20-45   glue_1.6.1        quadprog_1.5-8   
[16] digest_0.6.29     rvest_1.0.2       colorspace_2.0-2  Matrix_1.3-4      timeDate_3043.102
[21] pkgconfig_2.0.3   broom_0.7.12      haven_2.4.3       scales_1.1.1      tzdb_0.2.0       
[26] mgcv_1.8-38       generics_0.1.2    farver_2.1.0      ellipsis_0.3.2    withr_2.5.0      
[31] urca_1.3-0        cli_3.1.1         quantmod_0.4.18   magrittr_2.0.2    crayon_1.5.0     
[36] forecast_8.16     fs_1.5.2          fansi_1.0.2       xts_0.12.1        xml2_1.3.3       
[41] tools_4.1.2       hms_1.1.1         lifecycle_1.0.1   munsell_0.5.0     reprex_2.0.1     
[46] compiler_4.1.2    rlang_1.0.1       grid_4.1.2        rstudioapi_0.13   INLA_21.11.22    
[51] labeling_0.4.2    gtable_0.3.0      fracdiff_1.5-1    abind_1.4-5       DBI_1.1.2        
[56] curl_4.3.2        R6_2.5.1          zoo_1.8-9         utf8_1.2.2        stringi_1.7.6    
[61] parallel_4.1.2    Rcpp_1.0.8        vctrs_0.3.8       dbplyr_2.1.1      tidyselect_1.1.2 
[66] lmtest_0.9-39

You can use `across()` without any helper functions or loops. For example, `df_weekly = df %>% group_by(location, week) %>% summarize(across(all_of(columns_to_sum), sum))` where `columns_to_sum` is the names of the columns you want to sum. I'm not really sure what you're trying to accomplish with the `summarize(identity())` for the ID columns... if you're not aggregating it just seems like a long way to write `df %>% select(location, week, all_of(id_cols))`. — Gregor Thomas, Sep 15 '22 at 02:11
Your attempt is not working because `!!` is used when passing unquoted symbols, e.g., it would work if you did `sumByColumn(df_weekly, total_cases)`. You are not doing, you are passing a string column name. For that, we need to use `.data[[colName]]` inside the function. See Ben Bolker's answer [at the related FAQ](https://stackoverflow.com/a/56830842/903061) or the [Programming with dplyr vignette](https://dplyr.tidyverse.org/articles/programming.html) (see the section *"When you have an env-variable that is a character vector..."*) — Gregor Thomas, Sep 15 '22 at 02:18
(Though also your `for` loop won't because every calculation overwrites the same `df_weekly` object. Only the results from the last iteration will be available.) — Gregor Thomas, Sep 15 '22 at 02:20
If you need a more explicit answer, please make your example a bit more reproducible. Your sample data doesn't have a `week` column which your code uses, and it seems to be all a single location in a single week, so it doesn't illustrate the problem very well. 4 rows total in 2 different weeks would be better. And if you want to try to address the **Note** make sure there is sample data to illustrate that part of the problem too. And it would be nice if you'd provide the definition of `id_cols` in code, not just a description. Showing the desired results for the sample input is always nice. — Gregor Thomas, Sep 15 '22 at 02:23
I found Gregor's approach the most useful! But unfortunately I am getting this error after passing his function: ``` Error in `context_peek()`: ! `across()` must only be used inside dplyr verbs. ``` — therickster, Sep 15 '22 at 19:41
As I said, if you need a more explicit answer, please make your data more reproducible. Also make sure you didn't load `plyr` **after** `dplyr` and ignored the warning about the plyr version of summarize masking the `dplyr` version [as in this FAQ](https://stackoverflow.com/q/26106146/903061). — Gregor Thomas, Sep 15 '22 at 19:47
In your session info you have `plyr` listed before `dplyr`, which is generally bad as I point out in my last comment. If you see that FAQ I link to it has more explanation. You can explicitly use `dplyr::summarize` instead of just `summarize` as a workaround. — Gregor Thomas, Sep 18 '22 at 02:20

score 0 · Answer 1 · answered Sep 15 '22 at 20:16

0

library(dplyr)
cols_to_sum = c("total_cases", "new_cases")

df_weekly = df %>% 
  group_by(location, week) %>% 
  summarize(across(all_of(cols_to_sum), sum))

df_weekly
# # A tibble: 2 × 4
# # Groups:   location [1]
#   location    week    total_cases new_cases
#   <chr>       <chr>         <int>     <int>
# 1 Afghanistan 2020-08          30         5
# 2 Afghanistan 2020-09          20         0

answered Sep 15 '22 at 20:16

Gregor Thomas

136,190
20
167
294

I'm getting the same error... I'm not sure what's happening. df %>% group_by(location, week) %>% summarize(across(all_of(columns_to_sum), sum)) Error in `context_peek()`: ! `across()` must only be used inside dplyr verbs. – therickster Sep 15 '22 at 20:48
Can you edit your `sessionInfo()` into your question? Or just let me know what versions of R and `dplyr` you're using? – Gregor Thomas Sep 15 '22 at 21:39
For some reason my last comment didn't show. Anyways, I fixed it by specifying `dplyr::summarize()` and it now leads to a new error: Error in `h()`: ! Problem with `summarise()` input `..1`. i `..1 = across(all_of(columns_to_sum), sum)`. x invalid 'type' (character) of argument i The error occurred in group 1: location = "Afghanistan", week = "2020-08". (continued below) – therickster Sep 16 '22 at 22:19
So I figured I need to change some columns from "chr" to "double": `change_cols_df = c(26,28,30,31,32,33,34,35,36,37,39,40,41,42,43,45,46,47) > columns_to_change = colnames(df)[change_cols_df] > df = df %>% mutate(across(columns_to_change, as.double)) ` But calling `df[change_cols_df]` still displays each column as `` even though I literally asked it to change it. – therickster Sep 16 '22 at 22:19

Iterate through column names to get different type of functions summarized by week in r dataframe using dplyr

1 Answers1