0

I have a tibble that looks like this:

# A tibble: 5 × 12
  geo_ID age_group sex   rep_date   ref_date   ref_rep_date new_case new_death new_recovery amount_case amount_death amount_recovery
   <dbl> <fct>     <fct> <date>     <date>     <lgl>           <dbl>     <dbl>        <dbl>       <dbl>        <dbl>           <dbl>
1   2000 A15-A34   M     2020-09-24 2020-09-18 TRUE                0        -9            0           1            0               1
2   2000 A35-A59   W     2022-03-30 2022-03-24 TRUE                0        -9           -9           1            0               0
3   2000 A15-A34   W     2021-08-03 2021-07-28 TRUE                0        -9            0           3            0               3
4   2000 A05-A14   W     2021-10-14 2021-10-12 TRUE                0        -9            0           1            0               1
5   2000 A35-A59   M     2020-11-06 2020-11-01 TRUE                0        -9            0           4            0               4

I then used dplyr to do the following:

inf_HH %>% select(ref_date, new_case, starts_with("amount")) %>% 
  filter(new_case != 0 & amount_death == 0 & amount_recovery == 0) %>% 
  select(ref_date, ends_with("case")) %>% arrange(desc(ref_date))

and get this tibble out (with %>% slice_sample(n = 20) appended):

# A tibble: 20 × 3
   ref_date   new_case amount_case
   <date>        <dbl>       <dbl>
 1 2022-04-02       -1          -1
 2 2022-04-07        1          58
 3 2022-04-07       -1          -1
 4 2022-04-12        1           1
 5 2022-04-04        1           2
 6 2022-04-07        1          10
 7 2022-04-12        1           2
 8 2022-04-11        1          23
 9 2022-04-10        1           1
10 2022-04-12       -1          -1
11 2022-04-10        1           1
12 2022-04-03        1           1
13 2022-04-04        1           1
14 2022-04-13        1           8
15 2022-04-07        1          41
16 2022-04-01        1           1
17 2022-04-08        1           5
18 2022-04-04        1           1
19 2022-04-09        1           1
20 2022-04-07        1           1

Without %>% slice_sample(n = 20) it has 183 rows, where ref_date has 19 unique values.

What I am trying to achieve is to sum up all values in amount_case for each unique value in ref_date.

What I have been trying to do is use distinct to make progress:

inf_HH %>% select(ref_date, new_case, starts_with("amount")) %>% 
  filter(new_case != 0 & amount_death == 0 & amount_recovery == 0) %>% 
  select(ref_date, ends_with("case")) %>% arrange(desc(ref_date)) %>% 
  distinct(ref_date, .keep_all = TRUE)

What I get is:

# A tibble: 19 × 3
   ref_date   new_case amount_case
   <date>        <dbl>       <dbl>
 1 2022-04-14        1          10
 2 2022-04-13        1          28
 3 2022-04-12       -1          -2
 4 2022-04-11       -1          -1
 5 2022-04-10       -1          -1
 6 2022-04-09        1           1
 7 2022-04-08        1           6
 8 2022-04-07        1           1
 9 2022-04-06        1           1
10 2022-04-05        1           2
11 2022-04-04        1           1
12 2022-04-03        1           1
13 2022-04-02       -1          -1
14 2022-04-01        1           1
15 2022-03-30        1           1
16 2022-03-28        1           1
17 2022-03-24        1           1
18 2022-03-22       -1          -1
19 2022-03-21        1           1

I would like to extend the dplyr-piping so that I get a tibble like the last one above but with the sums of all the values in amount_case for each unique value in ref_date.

Another way, which I cannot express in R, would be to somehow use group_by() with summarise() but I cannot find a way to use them to group rows AND add the values in amount_case.

Preferably this should happen with tidyverse / dplyr packages, which I am sort of comfortable with. I need to do the same operation on hundreds of other datasets, with similar structure, so it is out of question to address the unique values of ref_date manually.

r2evans
  • 141,215
  • 6
  • 77
  • 149

0 Answers0