1

I have the following data frame but in a bigger scale of course:

country year strain num_cases
mex 1996 sp_m014 412
mex 1996 sp_f014 214
mex 1998 sp_m014 150
mex 1998 sp_f014 200
usa 1996 sp_m014 200
usa 1996 sp_f014 180
usa 1997 sp_m014 190
usa 1997 sp_f014 150

I want to get the following result, that is the sum of sp_m014 (male) and sp_f014 (female) for mex and usa individually:

country year strain num_cases
mex 1996 sp 626
mex 1998 sp 350
usa 1996 sp 380
usa 1997 sp 340

In my real data frame I have a lot more age ranges, here I only show the 014 for males and females. But I want to summarize them that way for every age range and gender.

Thanks!

Jvet
  • 103
  • 2
  • 8
  • Yes sorry. That's why I reverted it to the original I think. – Jvet May 06 '21 at 19:42
  • Really sorry about that, my mistake! I got confused and started to change things, which I think I have reverted to the way it was. – Jvet May 06 '21 at 19:44

3 Answers3

3

Grouped by 'country', 'year' summarise to update the 'strain' as 'sp' and get the sum of 'num_cases'

library(dplyr)
df1 %>%
   group_by(country, year) %>%
   summarise(strain = 'sp', num_cases = sum(num_cases), .groups = 'drop')

-output

# A tibble: 4 x 4
#  country  year strain num_cases
#* <chr>   <int> <chr>      <int>
#1 mex      1996 sp           626
#2 mex      1998 sp           350
#3 usa      1996 sp           380
#4 usa      1997 sp           340

data

df1 <- structure(list(country = c("mex", "mex", "mex", "mex", "usa", 
"usa", "usa", "usa"), year = c(1996L, 1996L, 1998L, 1998L, 1996L, 
1996L, 1997L, 1997L), strain = c("sp_m014", "sp_f014", "sp_m014", 
"sp_f014", "sp_m014", "sp_f014", "sp_m014", "sp_f014"), num_cases = c(412L, 
214L, 150L, 200L, 200L, 180L, 190L, 150L)), 
class = "data.frame", row.names = c(NA, 
-8L))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thanks! this solved my problem. By the way, I can't find inside the "summarize" documentation the .group("drop") argument. Can you point me to the place where I can see how it works? – Jvet May 06 '21 at 17:14
  • @Jvet It is mentioned in the `?summarise`. are you using the current version of `dplyr`. Also, can you check [here](https://stackoverflow.com/questions/62140483/how-to-interpret-dplyr-message-summarise-regrouping-output-by-x-override/62140681#62140681) – akrun May 06 '21 at 17:16
2

Here's an approach with tidyr::extract:

library(tidyr);library(dplyr)
df1 %>%
  extract(strain, into = c("strain","sex","age"), "(\\w+)_([mf])(.*)") %>%
  group_by(country,year,strain) %>% 
  summarise(across(num_cases,sum))
# A tibble: 4 x 4
# Groups:   country, year [4]
  country  year strain num_cases
  <chr>   <int> <chr>      <int>
1 mex      1996 sp           626
2 mex      1998 sp           350
3 usa      1996 sp           380
4 usa      1997 sp           340

Now that you have the strains fully parsed you can easily group by sex or age. Thanks to @akrun for the data.

Ian Campbell
  • 23,484
  • 14
  • 36
  • 57
  • 1
    I think you don't want the question mark in the regular expression. Otherwise the `age` column is empty – Alex May 06 '21 at 16:43
1

Update: To use the age range you can do parse_number

df1 %>% 
  mutate(age_range=parse_number(strain)) %>% 
  group_by(country, year, age_range) %>% 
  summarise(num_cases=sum(num_cases))

Output:

  country  year age_range num_cases
  <chr>   <int>     <dbl>     <int>
1 mex      1996        14       626
2 mex      1998        14       350
3 usa      1996        14       380
4 usa      1997        14       340

First answer: Thanks to akrun for the data:

library(tidyverse)
df1 %>% 
  group_by(country, year, strain) %>% 
  mutate(strain=str_extract(strain, "^.{2}")) %>% 
  summarise(num_cases=sum(num_cases))

Output:

  country  year strain num_cases
  <chr>   <int> <chr>      <int>
1 mex      1996 sp           626
2 mex      1998 sp           350
3 usa      1996 sp           380
4 usa      1997 sp           340
TarJae
  • 72,363
  • 6
  • 19
  • 66
  • Hi thanks! What if instead of sp? I would like to sort the age range. As I mention, the sp_014 is in my data according to gender and age. Instead of the column "strain", we generate one called "age_range" and then we see the same num_cases but for the age_range. 014 is just one example, I have in my real data 014, 1524, 2534, until 65. – Jvet May 06 '21 at 17:28
  • Please see my edit. I did not realized this in my first answer. – TarJae May 06 '21 at 17:35