2

I have the following dataframe

df <- data.frame(n_val1 =c(1,2,1,1), n_val2 = c(2,3,5,1), p_val1 = c(2,4,5,1), p_val2 = c(2,3,1,1), n_sum_val1 = c(5,4,2,4), n_sum_val2 = c(6,7,8,9), p_sum_val1 = c(5, 9, 1, 3), p_sum_val2 = c(6,8,2,4) )

I want calculate corresponding percentage (e.g., n_val1/n_sum_val1, n_val2/n_sum_val2), same applies to (p_val1/p_sum_val1, p_val2/p_sum_val2). To do this, I used the following code for one pair:

map2(c(paste0('n_val', 1:2)), c(paste0('n_sum_val', 1:2)), ~ df[[.x]] / df[[.y]] ) %>% set_names(c(paste0('n_pct', 1:2))) %>% bind_cols(df, .) 

How can I calculate percentages for both pairs of columns p_ and n_ ?

Vendetta
  • 294
  • 4
  • 11

2 Answers2

1

You have a lot of what I might call metadata encoded in those column names that you want to use to match them. Therefore I suggest using tidyr::pivot_longer() to get that metadata out of column names and into the actual data.frame and then parse it out so you can match the each record with the sum you want to divide by.

I used tidyr::unite() to restore your original column names and tidyr::pivot_wider() to put it back into 'wide' format.

Note on the first separate() call I used a regex that splits on the last occurrence of _ since some of them have the word sum and some don't.

library(tidyverse)

df <- data.frame(n_val1 =c(1,2,1,1), n_val2 = c(2,3,5,1), p_val1 = c(2,4,5,1), p_val2 = c(2,3,1,1), n_sum_val1 = c(5,4,2,4), n_sum_val2 = c(6,7,8,9), p_sum_val1 = c(5, 9, 1, 3), p_sum_val2 = c(6,8,2,4) )

df %>% 
  mutate(id = row_number()) %>% 
  pivot_longer(-id) %>% 
  separate(name, into = c("a", "c"), sep = "_(?!(?s:.*)_)") %>% 
  separate(a, into = c("a", "b"), sep = "_") %>% 
  separate(c, into = c("c", "d"), sep = 3) %>% 
  mutate(b = case_when(
    is.na(b) ~ "numerator", 
    T ~ "denominator")) %>% 
  pivot_wider(names_from = b, values_from = value) %>% 
  mutate(percent = numerator/denominator) %>% 
  pivot_longer(numerator:denominator, names_to = "b") %>% 
  mutate(b = case_when(b == "denominator" ~ "sum")) %>% 
  unite("name", c(a, b, c, d), na.rm = T) %>% 
  pivot_wider(values_from = c(value, percent), names_from = name, names_glue = "{name}_{.value}")
#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 16 rows [1, 2, 3,
#> 4, 9, 10, 11, 12, 17, 18, 19, 20, 25, 26, 27, 28].
#> # A tibble: 4 x 17
#>      id n_val_1_value n_sum_val_1_value n_val_2_value n_sum_val_2_value
#>   <int>         <dbl>             <dbl>         <dbl>             <dbl>
#> 1     1             1                 5             2                 6
#> 2     2             2                 4             3                 7
#> 3     3             1                 2             5                 8
#> 4     4             1                 4             1                 9
#> # ... with 12 more variables: p_val_1_value <dbl>, p_sum_val_1_value <dbl>,
#> #   p_val_2_value <dbl>, p_sum_val_2_value <dbl>, n_val_1_percent <dbl>,
#> #   n_sum_val_1_percent <dbl>, n_val_2_percent <dbl>,
#> #   n_sum_val_2_percent <dbl>, p_val_1_percent <dbl>,
#> #   p_sum_val_1_percent <dbl>, p_val_2_percent <dbl>, p_sum_val_2_percent <dbl>

Created on 2022-03-10 by the reprex package (v2.0.1)

Dan Adams
  • 4,971
  • 9
  • 28
  • I appreciate if you could please output the data in same format as my map2 command above. I tried uniting and pivoting, `tidyr::unite("z", a:d, remove = FALSE) %>% pivot_wider(names_from = z, values_from = percent)`, but this does not seem to working. – Vendetta Mar 10 '22 at 07:15
  • See updated answer. – Dan Adams Mar 10 '22 at 07:44
1

Here is an alternative approach using tidyverse with across. You can select the columns based on a character pattern (e.g., starts with "n_val" and "n_sum_val"). Your new column names can be indicated based on the pattern using the .names argument.

library(tidyverse)

df %>%
  mutate(across(starts_with("n_val"), .names = "pct_{col}") / across(starts_with("n_sum_val")),
         across(starts_with("p_val"), .names = "pct_{col}") / across(starts_with("p_sum_val")))

Output

  n_val1 n_val2 p_val1 p_val2 n_sum_val1 n_sum_val2 p_sum_val1 p_sum_val2 pct_n_val1 pct_n_val2 pct_p_val1 pct_p_val2
1      1      2      2      2          5          6          5          6       0.20  0.3333333  0.4000000  0.3333333
2      2      3      4      3          4          7          9          8       0.50  0.4285714  0.4444444  0.3750000
3      1      5      5      1          2          8          1          2       0.50  0.6250000  5.0000000  0.5000000
4      1      1      1      1          4          9          3          4       0.25  0.1111111  0.3333333  0.2500000
Ben
  • 28,684
  • 5
  • 23
  • 45