3

I have many columns that have same names that always start with the same string, either n_ for the number of students, score_ for the percent of students who passed, and loc_ for the room number.

In this, I want to multiple the n_ columns with their respective score_ columns (so n_math * score_math, n_sci * score_sci, etc.) and create new columns called n_*_success for the number of students who passed the class.

If I had just a few columns like in this sample dataset, I would do something like this for each column:

mutate(n_sci_success = n_sci * score_sci)

But I have many columns and I'd like to write some expression that will match column names.

I think I have to use regex and across (like across(starts_with("n_)), but I just can't figure it out. Any help would be much appreciated!

Here's a sample dataset:

library(tidyverse)

test <- tibble(id = c(1:4),
               n_sci = c(10, 20, 30, 40),
               score_sci = c(1, .9, .75, .7),
               loc_sci = c(1, 2, 3, 4),
               n_math = c(100, 50, 40, 30),
               score_math = c(.5, .6, .7, .8),
               loc_math = c(4, 3, 2, 1),
               n_hist = c(10, 50, 30, 20),
               score_hist = c(.5, .5, .9, .9),
               loc_hist = c(2, 1, 4, 3))

J.Sabree
  • 2,280
  • 19
  • 48

3 Answers3

2

Here's one way using across and new pick function from dplyr 1.1.0

library(dplyr)

out <- test %>%
  mutate(across(starts_with('n_'), .names = 'res_{col}') * 
           pick(starts_with('score_')) * pick(starts_with('loc_')))

out %>% select(starts_with('res'))

#  res_n_sci res_n_math res_n_hist
#      <dbl>      <dbl>      <dbl>
#1      10          200         10
#2      36           90         25
#3      67.5         56        108
#4     112           24         54

This should also work if you replace all pick with across. pick is useful for selecting columns, across is useful when you need to apply a function to the columns selected.

I am using across in the 1st case (with starts_with('n_')) is because I want to give unique names to the new columns using .names which is not present in pick.

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • thanks! Quick question--how does pick() know to match the columns together? For instance, when I look at it, my (limited) understanding would suggest that R would think to multiple ALL n_ rows by ALL score_ rows and ALL loc_ rows, but that's clearly not the case. Is it because of the col argument? – J.Sabree Feb 12 '23 at 04:39
  • 1
    `pick` does not match the columns, it just selects them. It requires the columns to have specific order either all `sci` columns should be together (like in the example) or all `n`, `score`, `loc` should be together. – Ronak Shah Feb 12 '23 at 04:56
2

pick() is very nice, thanks for sharing. Here is way using reduce from purrr package:

We first use split.default to get a list, then apply reduce via map_dfr:

library(purrr)
library(stringr)
test %>%
  split.default(str_remove(names(.), ".*_")) %>% 
  map_dfr(reduce, `*`) 
# A tibble: 4 × 4
   hist    id  math   sci
  <dbl> <int> <dbl> <dbl>
1    10     1   200  10  
2    25     2    90  36  
3   108     3    56  67.5
4    54     4    24 112  
TarJae
  • 72,363
  • 6
  • 19
  • 66
2

We can use {dplyover} to solve this kind of problems. Disclaimer: I'm the maintainer and the package is not on CRAN.

We have two options:

The easiest way is to use dplyover::across2(). Below I use dplyr::transmute() to only show the newly created columns, but we can use mutate() instead to add the new columns to our data.frame.

across2() lets you specify two sets of columns to loop over. Here we choose all columns that starts_with("n_") and all columns that starts_with("score_"). We can then use .x (for the former) and .y (for the latter) in the .fns argument. In the .names argument we can specify how our new names should look like. We take the name of the first column {xcol} and add _success to it.

library(dplyover) # https://timteafan.github.io/dplyover/

test %>% 
  transmute(
    across2(starts_with("n_"),
           starts_with("score_"),
           ~ .x * .y,
           .names = "{xcol}_success")
  )
#> # A tibble: 4 × 3
#>   n_sci_success n_math_success n_hist_success
#>           <dbl>          <dbl>          <dbl>
#> 1          10               50              5
#> 2          18               30             25
#> 3          22.5             28             27
#> 4          28               24             18

While this approach is easy and straightforward there is one problem: it assumes that the columns are in the correct order. This is also an assumption of the other two answers. If we have a large data.frame and are not sure if really all columns are in the correct order, dplyover::over() is the better and programmatically safe option.

Here we loop over a string and use this to construct the variable names. Within over() we use cut_names("^.*_") to get the stems of the variable names, in our example c("sci", "math", "hist"). Then in the function in .fns we construct the variable names by wrapping a string inside .() (to evaluate it as a variable name). Within the string we can use {x} to access the string of the current iteration. This approach will always combine n_sci with score_sci even if the columns are not ordered correctly. Finally, here too we can create nice names on the fly in the .names argument.

test %>% 
  transmute(
    over(cut_names("^.*_"), # <- gets us c("sci", "math", "hist")
         ~ .("n_{.x}") * .("score_{.x}"),
         .names = "n_{x}_success"
           )
  )
#> # A tibble: 4 × 3
#>   n_sci_success n_math_success n_hist_success
#>           <dbl>          <dbl>          <dbl>
#> 1          10               50              5
#> 2          18               30             25
#> 3          22.5             28             27
#> 4          28               24             18

Data from OP

library(tidyverse)

test <- tibble(id = c(1:4),
               n_sci = c(10, 20, 30, 40),
               score_sci = c(1, .9, .75, .7),
               loc_sci = c(1, 2, 3, 4),
               n_math = c(100, 50, 40, 30),
               score_math = c(.5, .6, .7, .8),
               loc_math = c(4, 3, 2, 1),
               n_hist = c(10, 50, 30, 20),
               score_hist = c(.5, .5, .9, .9),
               loc_hist = c(2, 1, 4, 3))

Created on 2023-02-12 with reprex v2.0.2

TimTeaFan
  • 17,549
  • 4
  • 18
  • 39