3

Suppose I want to make a new variable based on conditions within multiple other variables, and the condition is the same across each of these variables. I know I could use case_when(), but I'm curious to see if this can be simplified if my conditional phrase is the same for each conditional variable. I also want to know if this can be easily replicated to create multiple variables.

Example: A teacher has 3 students who have received grades for 3 tests and 3 quizzes. He wants to create a variable that says whether or not a student ever had a score of <70 on any test or quiz. So he will create two new variables as so:


ID <- c("Dave", "Joe", "Steve")
exam1 <- c(80, 100, 90)
exam2 <- c(30, 90, 88)
exam3 <- c(90, 65, 95)
quiz1 <- c(90, 90, 20)
quiz2 <- c(33, 100, 100)
quiz3 <- c(90, 90, 50)

data <- tibble(ID, exam1, exam2, exam3, quiz1, quiz2, quiz3)

data <- data %>% 
  mutate(
    fail_exam = case_when(
      exam1 < 70 ~ 1,
      exam2 < 70 ~ 1,
      exam3 < 70 ~ 1,
      T ~ 0
    ),
    fail_quiz = case_when(
      quiz1 < 70 ~ 1,
      quiz2 < 70 ~ 1,
      quiz3 < 70 ~ 1,
      T ~ 0
    )
  )

He ends up with the following output with his two new variables:

# A tibble: 3 × 9
  ID    exam1 exam2 exam3 quiz1 quiz2 quiz3 fail_exam fail_quiz
  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>     <dbl>     <dbl>
1 Dave     80    30    90    90    33    90         1         1
2 Joe     100    90    65    90   100    90         1         0
3 Steve    90    88    95    20   100    50         0         1

Now for the sake of this example, suppose you have 100 examination categories (e.g., mid-term, final exam, homework, etc.) for which students received grades, and you want create a new variable for each one of them indicating whether or not they ever had a failing score on it. One could iteratively go through each examination category as I did above with exam and quiz using case_when(), but I'd like to know if there is a simpler way to apply a single condition (i.e., if numeric score <70) to a list of examination categories (example: c("exam", "quiz", "homework", "midterm") that follow the numbering convention I have above in order to create unique output variables such as "fail_exam" and "fail_quiz" for each one of them.

This isn't mission critical, but looking to simplify things a bit.

Thx, C

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
Casey
  • 125
  • 6
  • 1
    Make your data [tidy](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html) by using `pivot_longer()` to create a tibble with columns (say) `ID`, `TestType`, `TestIndex` and `Score`. `group_by(ID, TestType)`. An individual `ID` fails the `TestType` if `min(Score) < 70`. – Limey Nov 03 '22 at 13:23

2 Answers2

5

You can use dplyr::if_any() to test multiple variables against a predicate function:

library(dplyr)

data %>% 
  mutate(
    fail_exam = as.numeric(if_any(exam1:exam3, ~ .x < 70)),
    fail_quiz = as.numeric(if_any(quiz1:quiz3, ~ .x < 70))
  )
# A tibble: 3 × 9
  ID    exam1 exam2 exam3 quiz1 quiz2 quiz3 fail_exam fail_quiz
  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>     <dbl>     <dbl>
1 Dave     80    30    90    90    33    90         1         1
2 Joe     100    90    65    90   100    90         1         0
3 Steve    90    88    95    20   100    50         0         1

PS - also see dplyr::if_all().

Edit: A solution to do the same across an arbitrary number of "quiz" / "exam" categories. This creates a separate summary table of failures by ID and type, which you can then merge back into your original dataframe.

library(dplyr)
library(tidyr)

failures <- data %>% 
  pivot_longer(
    !ID, 
    names_to = c("type", "number"),
    names_pattern = "^(\\w+)(\\d+)$"
  ) %>% 
  group_by(ID, type) %>%
  summarize(
    fail = as.numeric(any(value < 70)),
    .groups = "drop"
  ) %>%
  ungroup() %>%
  pivot_wider(
    names_from = type,
    names_glue = "fail_{type}",
    values_from = fail
  )

data %>% 
  left_join(failures)
zephryl
  • 14,633
  • 3
  • 11
  • 30
1

Here is a generalized way using map_dfc and if_any.

library(dplyr)
cols <- c("exam", "quiz")
data %>% 
  mutate(map_dfc(cols, ~ transmute(data, "fail_{.x}" := +if_any(starts_with(.x), `<`, 70))))

output

# A tibble: 3 × 9
  ID    exam1 exam2 exam3 quiz1 quiz2 quiz3 fail_exam fail_quiz
  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>     <int>     <int>
1 Dave     80    30    90    90    33    90         1         1
2 Joe     100    90    65    90   100    90         1         0
3 Steve    90    88    95    20   100    50         0         1
Maël
  • 45,206
  • 3
  • 29
  • 67
  • Yes this did exactly what I wanted. I have a bit to learn about this strategy. Thanks. – Casey Nov 03 '22 at 16:06
  • Maël, would it be possible to walk me through that line of code? Specifically: 1. What does ":=" do in this code? 2. Why does "+" go before if_any() 3. Why does the "<" have to go between ``? Thank you! – Casey Nov 03 '22 at 18:02
  • Sure. `:=` is a way to create variables using `glue` syntax. Check [here](https://stackoverflow.com/questions/26003574/use-dynamic-name-for-new-column-variable-in-dplyr) for more examples. `+` is used here to coerce the logical vector created by `if_any` into a integer vector of 1 (TRUE) and 0 (FALSE). `<` has to go between backticks because it is a special function (check [here](https://stackoverflow.com/questions/36220823/what-do-backticks-do-in-r#:~:text=A%20pair%20of%20backticks%20is,non%2Dsyntactic%20combinations%20like%20c%20a%20t%20.) for more) – Maël Nov 03 '22 at 18:26