I have a dataframe which contains >100 columns, some are numeric, some not.
All variables ending with "_f" or "_m" are numeric variables and I would like to sum all the pairs that start with the same pattern but end with "_f" or "_m".
Here is an example of variable names in my dataframe:
xxxxxxxxxxxxx_age1_f
xxxxxxxxxxxxx_age1_m
xxxxxxxxxxxxx_age2_f
xxxxxxxxxxxxx_age2_m
xxxxxxxxxxxxx_age3_f
xxxxxxxxxxxxx_age3_m
yyyyyyyyyy_age1_f
yyyyyyyyyy_age1_m
yyyyyyyyyy_age2_f
yyyyyyyyyy_age2_m
yyyyyyyyyy_age3_f
yyyyyyyyyy_age3_m
yyyyyyyyyy_age4_f
yyyyyyyyyy_age4_m
yyyyyyyyyy_age5_f
yyyyyyyyyy_age5_m
zzzzzzzzzzzzzzzzzzzz_age1_f
zzzzzzzzzzzzzzzzzzzz_age1_m
zzzzzzzzzzzzzzzzzzzz_age2_f
zzzzzzzzzzzzzzzzzzzz_age2_m
zzzzzzzzzzzzzzzzzzzz_age3_f
zzzzzzzzzzzzzzzzzzzz_age3_m
text_var_11
text_var_222
text_var_33333
(I'm abstracting the names here with x, y z to make my question clearer, they are not really named like that)
My first solution would be to sum each pair using dplyr::mutate()
like this:
mutate( ... ) %>%
mutate( yyyyyyyyyy_age2 = yyyyyyyyyy_age2_f + yyyyyyyyyy_age2_m) %>%
mutate( yyyyyyyyyy_age3 = yyyyyyyyyy_age3_f + yyyyyyyyyy_age3_m) %>%
mutate( ... ) %>%
This will work, but there must be more intelligent way to do this without repeating this for all variable pairs.
After looking for a solution, the closest I found was this Sum all columns whose names start with a pattern, by group
However the proposed solution doesn't work in my case for 2 reasons:
- the
substr()
is not applicable to my problem since the lengths of the variable names change - this method assumes I only have variables I want to sum, while in my case I have many other variables that don't end with "_f" or "_m" and don't need to be summed (and cannot be summed since some are text)
I suppose the solution could be modified to apply to my case but I'm not sure how.
EDIT: here is sample data created with dput
structure(list(Groups = c("xx", "xx", "xx"), xxxxx_age0_f = c(8,
0, 7), xxxxx_age0_m = c(5, 0, 0), xxxxx_age1_f = c(1,
0, 0), xxxxx_age1_m = c(3, 2, 0), xxxxx_age2_f = c(0,
0, 2), xxxxx_age2_m = c(0, 1, 0), zzzz_age0_f = c(4,
2, NA), zzzz_age0_m = c(3, 6, NA), zzzz_age1_f = c(0,
0, NA), zzzz_age1_m = c(2, 0, NA), zzzz_age2_f = c(4,
1, NA), zzzz_age2_m = c(3, 1, NA)), row.names = c(NA, -3L
), class = c("tbl_df", "tbl", "data.frame"))
ps: this simplified example has only one categorical variable, while I have hundreds.