How to summarize two different rows with different values to a single row with that sum using dplyr?

Question

I have the following data frame but in a bigger scale of course:

country	year	strain	num_cases
mex	1996	sp_m014	412
mex	1996	sp_f014	214
mex	1998	sp_m014	150
mex	1998	sp_f014	200
usa	1996	sp_m014	200
usa	1996	sp_f014	180
usa	1997	sp_m014	190
usa	1997	sp_f014	150

I want to get the following result, that is the sum of sp_m014 (male) and sp_f014 (female) for mex and usa individually:

country	year	strain	num_cases
mex	1996	sp	626
mex	1998	sp	350
usa	1996	sp	380
usa	1997	sp	340

In my real data frame I have a lot more age ranges, here I only show the 014 for males and females. But I want to summarize them that way for every age range and gender.

Thanks!

Yes sorry. That's why I reverted it to the original I think. — Jvet, May 06 '21 at 19:42
Really sorry about that, my mistake! I got confused and started to change things, which I think I have reverted to the way it was. — Jvet, May 06 '21 at 19:44

score 3 · Accepted Answer · answered May 06 '21 at 16:31

Grouped by 'country', 'year' summarise to update the 'strain' as 'sp' and get the sum of 'num_cases'

library(dplyr)
df1 %>%
   group_by(country, year) %>%
   summarise(strain = 'sp', num_cases = sum(num_cases), .groups = 'drop')

-output

# A tibble: 4 x 4
#  country  year strain num_cases
#* <chr>   <int> <chr>      <int>
#1 mex      1996 sp           626
#2 mex      1998 sp           350
#3 usa      1996 sp           380
#4 usa      1997 sp           340

data

df1 <- structure(list(country = c("mex", "mex", "mex", "mex", "usa", 
"usa", "usa", "usa"), year = c(1996L, 1996L, 1998L, 1998L, 1996L, 
1996L, 1997L, 1997L), strain = c("sp_m014", "sp_f014", "sp_m014", 
"sp_f014", "sp_m014", "sp_f014", "sp_m014", "sp_f014"), num_cases = c(412L, 
214L, 150L, 200L, 200L, 180L, 190L, 150L)), 
class = "data.frame", row.names = c(NA, 
-8L))

Thanks! this solved my problem. By the way, I can't find inside the "summarize" documentation the .group("drop") argument. Can you point me to the place where I can see how it works? — Jvet, May 06 '21 at 17:14
@Jvet It is mentioned in the `?summarise`. are you using the current version of `dplyr`. Also, can you check [here](https://stackoverflow.com/questions/62140483/how-to-interpret-dplyr-message-summarise-regrouping-output-by-x-override/62140681#62140681) — akrun, May 06 '21 at 17:16

Ian Campbell · Answer 2 · 2021-05-06T16:44:35.603

2

Here's an approach with tidyr::extract:

library(tidyr);library(dplyr)
df1 %>%
  extract(strain, into = c("strain","sex","age"), "(\\w+)_([mf])(.*)") %>%
  group_by(country,year,strain) %>% 
  summarise(across(num_cases,sum))
# A tibble: 4 x 4
# Groups:   country, year [4]
  country  year strain num_cases
  <chr>   <int> <chr>      <int>
1 mex      1996 sp           626
2 mex      1998 sp           350
3 usa      1996 sp           380
4 usa      1997 sp           340

Now that you have the strains fully parsed you can easily group by sex or age. Thanks to @akrun for the data.

edited May 06 '21 at 16:44

answered May 06 '21 at 16:38

Ian Campbell

23,484
14
36
57

1

I think you don't want the question mark in the regular expression. Otherwise the `age` column is empty – Alex May 06 '21 at 16:43

TarJae · Answer 3 · 2021-05-06T17:34:50.327

1

Update: To use the age range you can do parse_number

df1 %>% 
  mutate(age_range=parse_number(strain)) %>% 
  group_by(country, year, age_range) %>% 
  summarise(num_cases=sum(num_cases))

Output:

  country  year age_range num_cases
  <chr>   <int>     <dbl>     <int>
1 mex      1996        14       626
2 mex      1998        14       350
3 usa      1996        14       380
4 usa      1997        14       340

First answer: Thanks to akrun for the data:

library(tidyverse)
df1 %>% 
  group_by(country, year, strain) %>% 
  mutate(strain=str_extract(strain, "^.{2}")) %>% 
  summarise(num_cases=sum(num_cases))

Output:

  country  year strain num_cases
  <chr>   <int> <chr>      <int>
1 mex      1996 sp           626
2 mex      1998 sp           350
3 usa      1996 sp           380
4 usa      1997 sp           340

edited May 06 '21 at 17:34

answered May 06 '21 at 16:45

TarJae

72,363
6
19
66

Hi thanks! What if instead of sp? I would like to sort the age range. As I mention, the sp_014 is in my data according to gender and age. Instead of the column "strain", we generate one called "age_range" and then we see the same num_cases but for the age_range. 014 is just one example, I have in my real data 014, 1524, 2534, until 65. – Jvet May 06 '21 at 17:28
Please see my edit. I did not realized this in my first answer. – TarJae May 06 '21 at 17:35

How to summarize two different rows with different values to a single row with that sum using dplyr?

3 Answers3

data