0

I am trying to iterate through global health epidemic data on a database which consists of daily cases, cumulative cases, daily deaths, and cumulative deaths (as well as some other covariables which aren't really relevant here). The table is structured as follows: For each country (with country name listed, region, ID) and each date (though not all dates are displayed for all countries*) the daily/cumulative cases/deaths/etc. are listed.

The data looks something like this:

# A tibble: 40 x 7
   iso_code continent location    date       total_cases new_cases week   
   <chr>    <chr>     <chr>       <date>           <dbl>     <dbl> <chr>  
 1 AFG      Asia      Afghanistan 2020-02-24           5         5 2020-08
 2 AFG      Asia      Afghanistan 2020-02-25           5         0 2020-08
 3 AFG      Asia      Afghanistan 2020-02-26           5         0 2020-08
 4 AFG      Asia      Afghanistan 2020-02-27           5         0 2020-08
 5 AFG      Asia      Afghanistan 2020-02-28           5         0 2020-08
 6 AFG      Asia      Afghanistan 2020-02-29           5         0 2020-08
 7 AFG      Asia      Afghanistan 2020-03-01           5         0 2020-09
 8 AFG      Asia      Afghanistan 2020-03-02           5         0 2020-09
 9 AFG      Asia      Afghanistan 2020-03-03           5         0 2020-09
10 AFG      Asia      Afghanistan 2020-03-04           5         0 2020-09
# ... with 30 more rows

I need to summarize the daily data into weekly data. Of course, this is no problem for one column: using methods described here I should be able to aggregate the data for each week, for each country as follows~

library(dplyr)
sumByColumn <- function(df, colName) {
# the method for daily (cases/deaths)/(cases/deaths) smoothed
  df %>%
    group_by(location, week) %>%
    summarize(colName = sum(!! sym(colName)))
}
idByColumn <- function(df, colName) {
# the method for cumulative (cases/deaths)
  df %>%
    group_by(location, week) %>%
    summarize(colName = identity(!! sym(colName)))
}

(It should be noted that, obviously, daily case/death data will be summarized, whereas cumulative case/death data will be simply the identity function as given. These columns, in the list of column names of df, are denoted as id_cols.)

However, when I try to run the sumByColumn()/idByColumn() loop along the entire dataframe df, I run into this error:

for (col in 1:ncol(df)) {
  colName = colnames(df)[col]
  if (col%in%id_cols) {
    df_weekly = idByColumn(df_weekly,colName)
  } else {
    df_weekly = sumByColumn(df_weekly,colName)
  }
}

I get:

Error in !sym(colName) : invalid argument type

Note: I have computed the frequency by which the number of times each country appears in the dataframe, which corresponds to the number of days the disease was tracked. Is there a way to account for this, e.g. when I go through the weeks, if there is no data for that week, or an uneven number of countries per week give data, to ignore it and not return NA?

916
916
910
892
884
899
971
938
899
946

Edit:

R Session Info is:

R version 4.1.2 (2021-11-01)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=English_(Country).1252  LC_CTYPE=English_(Country).1252    LC_MONETARY=English_(Country).1252
[4] LC_NUMERIC=C                    LC_TIME=English_(Country).1252    
system code page: 65001

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] nnet_7.3-17           plyr_1.8.6            car_3.0-12            carData_3.0-5        
 [5] nlme_3.1-153          lubridate_1.8.0       gridExtra_2.3         ExcelFunctionsR_0.1.4
 [9] forcats_0.5.1         stringr_1.4.0         purrr_0.3.4           readr_2.1.2          
[13] tidyr_1.2.0           tibble_3.1.6          ggplot2_3.3.5         tidyverse_1.3.1      
[17] dplyr_1.0.7           readxl_1.3.1          poisson_1.0          

loaded via a namespace (and not attached):
 [1] tseries_0.10-49   httr_1.4.2        jsonlite_1.7.3    splines_4.1.2     modelr_0.1.8     
 [6] assertthat_0.2.1  TTR_0.24.3        sp_1.4-6          roperators_1.2.0  cellranger_1.1.0 
[11] pillar_1.7.0      backports_1.4.1   lattice_0.20-45   glue_1.6.1        quadprog_1.5-8   
[16] digest_0.6.29     rvest_1.0.2       colorspace_2.0-2  Matrix_1.3-4      timeDate_3043.102
[21] pkgconfig_2.0.3   broom_0.7.12      haven_2.4.3       scales_1.1.1      tzdb_0.2.0       
[26] mgcv_1.8-38       generics_0.1.2    farver_2.1.0      ellipsis_0.3.2    withr_2.5.0      
[31] urca_1.3-0        cli_3.1.1         quantmod_0.4.18   magrittr_2.0.2    crayon_1.5.0     
[36] forecast_8.16     fs_1.5.2          fansi_1.0.2       xts_0.12.1        xml2_1.3.3       
[41] tools_4.1.2       hms_1.1.1         lifecycle_1.0.1   munsell_0.5.0     reprex_2.0.1     
[46] compiler_4.1.2    rlang_1.0.1       grid_4.1.2        rstudioapi_0.13   INLA_21.11.22    
[51] labeling_0.4.2    gtable_0.3.0      fracdiff_1.5-1    abind_1.4-5       DBI_1.1.2        
[56] curl_4.3.2        R6_2.5.1          zoo_1.8-9         utf8_1.2.2        stringi_1.7.6    
[61] parallel_4.1.2    Rcpp_1.0.8        vctrs_0.3.8       dbplyr_2.1.1      tidyselect_1.1.2 
[66] lmtest_0.9-39
therickster
  • 111
  • 6
  • You can use `across()` without any helper functions or loops. For example, `df_weekly = df %>% group_by(location, week) %>% summarize(across(all_of(columns_to_sum), sum))` where `columns_to_sum` is the names of the columns you want to sum. I'm not really sure what you're trying to accomplish with the `summarize(identity())` for the ID columns... if you're not aggregating it just seems like a long way to write `df %>% select(location, week, all_of(id_cols))`. – Gregor Thomas Sep 15 '22 at 02:11
  • Your attempt is not working because `!!` is used when passing unquoted symbols, e.g., it would work if you did `sumByColumn(df_weekly, total_cases)`. You are not doing, you are passing a string column name. For that, we need to use `.data[[colName]]` inside the function. See Ben Bolker's answer [at the related FAQ](https://stackoverflow.com/a/56830842/903061) or the [Programming with dplyr vignette](https://dplyr.tidyverse.org/articles/programming.html) (see the section *"When you have an env-variable that is a character vector..."*) – Gregor Thomas Sep 15 '22 at 02:18
  • (Though also your `for` loop won't because every calculation overwrites the same `df_weekly` object. Only the results from the last iteration will be available.) – Gregor Thomas Sep 15 '22 at 02:20
  • If you need a more explicit answer, please make your example a bit more reproducible. Your sample data doesn't have a `week` column which your code uses, and it seems to be all a single location in a single week, so it doesn't illustrate the problem very well. 4 rows total in 2 different weeks would be better. And if you want to try to address the **Note** make sure there is sample data to illustrate that part of the problem too. And it would be nice if you'd provide the definition of `id_cols` in code, not just a description. Showing the desired results for the sample input is always nice. – Gregor Thomas Sep 15 '22 at 02:23
  • I found Gregor's approach the most useful! But unfortunately I am getting this error after passing his function: ``` Error in `context_peek()`: ! `across()` must only be used inside dplyr verbs. ``` – therickster Sep 15 '22 at 19:41
  • As I said, if you need a more explicit answer, please make your data more reproducible. Also make sure you didn't load `plyr` **after** `dplyr` and ignored the warning about the plyr version of summarize masking the `dplyr` version [as in this FAQ](https://stackoverflow.com/q/26106146/903061). – Gregor Thomas Sep 15 '22 at 19:47
  • hmm I haven't gotten that. I will edit my code – therickster Sep 15 '22 at 19:55
  • ok i fixed it. you should be able to see the edit now – therickster Sep 15 '22 at 19:59
  • In your session info you have `plyr` listed before `dplyr`, which is generally bad as I point out in my last comment. If you see that FAQ I link to it has more explanation. You can explicitly use `dplyr::summarize` instead of just `summarize` as a workaround. – Gregor Thomas Sep 18 '22 at 02:20

1 Answers1

0
library(dplyr)
cols_to_sum = c("total_cases", "new_cases")

df_weekly = df %>% 
  group_by(location, week) %>% 
  summarize(across(all_of(cols_to_sum), sum))

df_weekly
# # A tibble: 2 × 4
# # Groups:   location [1]
#   location    week    total_cases new_cases
#   <chr>       <chr>         <int>     <int>
# 1 Afghanistan 2020-08          30         5
# 2 Afghanistan 2020-09          20         0
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • I'm getting the same error... I'm not sure what's happening. df %>% group_by(location, week) %>% summarize(across(all_of(columns_to_sum), sum)) Error in `context_peek()`: ! `across()` must only be used inside dplyr verbs. – therickster Sep 15 '22 at 20:48
  • Can you edit your `sessionInfo()` into your question? Or just let me know what versions of R and `dplyr` you're using? – Gregor Thomas Sep 15 '22 at 21:39
  • For some reason my last comment didn't show. Anyways, I fixed it by specifying `dplyr::summarize()` and it now leads to a new error: Error in `h()`: ! Problem with `summarise()` input `..1`. i `..1 = across(all_of(columns_to_sum), sum)`. x invalid 'type' (character) of argument i The error occurred in group 1: location = "Afghanistan", week = "2020-08". (continued below) – therickster Sep 16 '22 at 22:19
  • So I figured I need to change some columns from "chr" to "double": `change_cols_df = c(26,28,30,31,32,33,34,35,36,37,39,40,41,42,43,45,46,47) > columns_to_change = colnames(df)[change_cols_df] > df = df %>% mutate(across(columns_to_change, as.double)) ` But calling `df[change_cols_df]` still displays each column as `` even though I literally asked it to change it. – therickster Sep 16 '22 at 22:19