1

I have a dataframe which contains >100 columns, some are numeric, some not.
All variables ending with "_f" or "_m" are numeric variables and I would like to sum all the pairs that start with the same pattern but end with "_f" or "_m".

Here is an example of variable names in my dataframe:

xxxxxxxxxxxxx_age1_f
xxxxxxxxxxxxx_age1_m
xxxxxxxxxxxxx_age2_f
xxxxxxxxxxxxx_age2_m
xxxxxxxxxxxxx_age3_f
xxxxxxxxxxxxx_age3_m
yyyyyyyyyy_age1_f
yyyyyyyyyy_age1_m
yyyyyyyyyy_age2_f
yyyyyyyyyy_age2_m
yyyyyyyyyy_age3_f
yyyyyyyyyy_age3_m
yyyyyyyyyy_age4_f
yyyyyyyyyy_age4_m
yyyyyyyyyy_age5_f
yyyyyyyyyy_age5_m
zzzzzzzzzzzzzzzzzzzz_age1_f
zzzzzzzzzzzzzzzzzzzz_age1_m
zzzzzzzzzzzzzzzzzzzz_age2_f
zzzzzzzzzzzzzzzzzzzz_age2_m
zzzzzzzzzzzzzzzzzzzz_age3_f
zzzzzzzzzzzzzzzzzzzz_age3_m
text_var_11
text_var_222
text_var_33333

(I'm abstracting the names here with x, y z to make my question clearer, they are not really named like that)

My first solution would be to sum each pair using dplyr::mutate() like this:

mutate( ... ) %>%
mutate( yyyyyyyyyy_age2 = yyyyyyyyyy_age2_f + yyyyyyyyyy_age2_m) %>%
mutate( yyyyyyyyyy_age3 = yyyyyyyyyy_age3_f + yyyyyyyyyy_age3_m) %>%
mutate( ... ) %>%

This will work, but there must be more intelligent way to do this without repeating this for all variable pairs.

After looking for a solution, the closest I found was this Sum all columns whose names start with a pattern, by group

However the proposed solution doesn't work in my case for 2 reasons:

  • the substr() is not applicable to my problem since the lengths of the variable names change
  • this method assumes I only have variables I want to sum, while in my case I have many other variables that don't end with "_f" or "_m" and don't need to be summed (and cannot be summed since some are text)

I suppose the solution could be modified to apply to my case but I'm not sure how.

EDIT: here is sample data created with dput

structure(list(Groups = c("xx", "xx", "xx"), xxxxx_age0_f = c(8, 
0, 7), xxxxx_age0_m = c(5, 0, 0), xxxxx_age1_f = c(1, 
0, 0), xxxxx_age1_m = c(3, 2, 0), xxxxx_age2_f = c(0, 
0, 2), xxxxx_age2_m = c(0, 1, 0), zzzz_age0_f = c(4, 
2, NA), zzzz_age0_m = c(3, 6, NA), zzzz_age1_f = c(0, 
0, NA), zzzz_age1_m = c(2, 0, NA), zzzz_age2_f = c(4, 
1, NA), zzzz_age2_m = c(3, 1, NA)), row.names = c(NA, -3L
), class = c("tbl_df", "tbl", "data.frame"))

ps: this simplified example has only one categorical variable, while I have hundreds.

stallingOne
  • 3,633
  • 3
  • 41
  • 63
  • I would suggest you to provide a piece of your dataframe as a reproducible example. That would increase your chances of getting help. – PaulS Feb 26 '22 at 13:52
  • 1
    **Minimal** reproducible examples are best. Include 2 or 3 column pairs, one column you don't want to sum, and 3 or 4 rows--that should be plenty to demonstrate solutions. `dput()` is the nicest way to share data as it is copy/pasteable. `dput(your_data[1:3, c("relevant", "columns", "for", "example")])`. – Gregor Thomas Feb 26 '22 at 13:53
  • Use `dput` function to insert a sample of your data here, as suggested by @GregorThomas! – PaulS Feb 26 '22 at 13:58
  • `starts_with()` will be your friend. That said, this is a classic example of why [tidy](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html) data frames are better than untidy data frames. Your layout is untidy because the column names contain information you need to create your summary. With a tidy data frame, solving your problem is not only easier, but the solution is more robust. – Limey Feb 26 '22 at 14:03
  • I have added the dput output, thanks for the suggestion @GregorThomas – stallingOne Feb 26 '22 at 14:11

2 Answers2

1

Updated, with OP's actual example data:

inner_join(
  dat,
  dat %>% 
    select(Groups, ends_with(c("_f", "_m"))) %>% 
    pivot_longer(cols=!Groups) %>% 
    mutate(name = gsub("_.$",replacement = "",name)) %>% 
    group_by(Groups, name) %>% 
    summarize(value=sum(value,na.rm=T)) %>% 
    pivot_wider(id_cols ="Groups", "name"),
  by="Groups"
)

Previous example, prior to OP's actual example.

If your table looks like this:

dat

# A tibble: 2 x 9
  zzzzzzzzzzzzzzzzzzzz_age1_f zzzzzzzzzzzzzzzzzzzz_age1_m zzzzzzzzzzzzzzzz~ zzzzzzzzzzzzzzz~ zzzzzzzzzzzzzzz~ zzzzzzzzzzzzzzz~ text_var_11 text_var_222 text_var_33333
                        <dbl>                       <dbl>             <dbl>            <dbl>            <dbl>            <dbl> <chr>       <chr>        <chr>         
1                      -0.709                       1.26               1.03             1.36           -0.140           -0.595 f           o            x             
2                      -0.202                       0.164             -1.28            -1.48           -0.380            0.874 a           p            m             

Then, you can do this:

dat %>% 
  select(ends_with(c("_f", "_m"))) %>% 
  pivot_longer(cols=everything()) %>% 
  mutate(name = gsub("_.$",replacement = "",name)) %>% 
  group_by(name) %>% 
  summarize(value=sum(value,na.rm=T))

To get this:

  name                       value
  <chr>                      <dbl>
1 zzzzzzzzzzzzzzzzzzzz_age1  0.510
2 zzzzzzzzzzzzzzzzzzzz_age2 -0.371
3 zzzzzzzzzzzzzzzzzzzz_age3 -0.240

Now, I don't know what your desired output structure is, but your mutate attempt above suggests you want to column bind these new aggregate columns. This could be done easily by taking that interim result above, and wrapping it in bind_cols, like this

bind_cols(
  dat,
  dat %>% 
    select(ends_with(c("_f", "_m"))) %>% 
    pivot_longer(cols=everything()) %>% 
    mutate(name = gsub("_.$",replacement = "",name)) %>% 
    group_by(name) %>% 
    summarize(value=sum(value,na.rm=T)) %>% 
    pivot_wider(id_cols ="name")
)
langtang
  • 22,248
  • 1
  • 12
  • 27
  • Bring the `groups` column along for the ride (or join back to the original data to get it), pivot back to wide, and I think this is probably what OP is looking for. – Gregor Thomas Feb 26 '22 at 14:11
  • thanks @GregorThomas, I was just adding that edit - you read my mind.. – langtang Feb 26 '22 at 14:13
  • Error in `pivot_wider_spec()`: `id_expand` must be a single `TRUE` or `FALSE`.` I think you need to remove `, "name"` from pivot_wider, correct ? – stallingOne Feb 26 '22 at 16:17
  • One downside I see is that you selected "Groups" it was the only categorical variable in my example but I actually have hundreds of variables. – stallingOne Feb 26 '22 at 16:29
  • Other issue is that it's taking groups as an id, which it is not. If group has always the same value (for example 'xx', then it's merging all the rows into 1. – stallingOne Feb 26 '22 at 16:32
  • just follow the second approach then.. i just assumed you wanted a grouping variable. If there is no grouping varaiable, the second approach will work fine – langtang Feb 26 '22 at 16:38
  • right, but bind_cols will fix that. will return the original number of rows, plus all original columns, plus the sums, which are constants over all the rows – langtang Feb 26 '22 at 16:54
  • The bind_cols method seems to work! Just one bug: I think you need to remove `id_cols ="name"` from `pivot_wider(id_cols ="name")` – stallingOne Feb 26 '22 at 17:24
0

This does not solve all of your problem but maybe you could try something along the lines of

df %>%
rowwise() # will apply your functions to rows instead of columns

for i in c("age1", "age2", "age3"){

df %>%
z <- mutate({{i}} = sum(c_across(contains({{i}})))) %>%
print(z)
}

df %>% ungroup()
terraviva
  • 11
  • 4
  • 1
    As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Jun 17 '22 at 12:48