9

Is there a way to dynamically/programmatically generate case_when conditions in dplyr with different column names and/or different numbers of conditions? I have an interactive script that I'm trying to convert into a function. There's a lot of repeated code in the case_when statements and I'm wondering if it can be automated somehow without my needing to write everything from scratch again and again.

Here's a dummy dataset:

test_df = tibble(low_A=c(5, 15, NA),
                 low_TOT=c(NA, 10, NA),
                 low_B=c(20, 25, 30),
                 high_A=c(NA, NA, 10),
                 high_TOT=c(NA, 40, NA),
                 high_B=c(60, 20, NA))

expected_df = tibble(low_A=c(5, 15, NA),
                     low_TOT=c(NA, 10, NA),
                     low_B=c(20, 25, 30),
                     ans_low=c(5, 10, 30),
                     high_A=c(NA, NA, 10),
                     high_TOT=c(NA, 40, NA),
                     high_B=c(60, 20, NA),
                     ans_high=c(60, 40, 10))

> expected_df
# A tibble: 3 x 8
  low_A low_TOT low_B ans_low high_A high_TOT high_B ans_high
  <dbl>   <dbl> <dbl>   <dbl>  <dbl>    <dbl>  <dbl>    <dbl>
1     5      NA    20       5     NA       NA     60       60
2    15      10    25      10     NA       40     20       40
3    NA      NA    30      30     10       NA     NA       10

The logic I want is that if the ._TOT column has a value, use that. If not, then try column ._A, and if not, then column ._B. Note that I intentionally didn't put ._TOT as the first column for a group. I could just use coalesce() in that case, but I want a general solution irrespective of column order.

Of course, all of this is easy to do with a couple of case_when statements. My issues are that:

  1. I'm trying to make a general function and so don't want interactive/tidy evaluation.
  2. I have a whole bunch of columns like this. All ending with one of _TOT, _A, _B but with different prefixes (e.g., low_TOT, low_A, low_B, high_TOT, high_A, high_B,..... and I don't want to rewrite a bunch of case_when functions again and again.

What I have right now looks like this (where I'm writing a case_when for each prefix):

def my_function = function(df) { 
    df %>% mutate(
          # If a total low doesn't exist, use A (if exists) or B (if exists)
          "ans_low" := case_when(
            !is.na(.data[["low_TOT"]]) ~ .data[["low_TOT"]],
            !is.na(.data[["low_A"]]) ~ .data[["low_A"]],
            !is.na(.data[["low_B"]]) ~ .data[["low_B"]],
          ),

          # If a total high doesn't exist, use A (if exists) or B (if exists)
          "ans_high" := case_when(
            !is.na(.data[["high_TOT"]]) ~ .data[["high_TOT"]],
            !is.na(.data[["high_A"]]) ~ .data[["high_R"]],
            !is.na(.data[["high_B"]]) ~ .data[["high_B"]],
              
         # Plus a whole bunch of similar case_when functions...
}

And what I'd like is to ideally get a way to dynamically generate case_when functions with different conditions so that I'm not writing a new case_when each time by exploiting the fact that:

  1. All the three conditions have the same general form, and the same structure for the variable names, but with a different prefix (high_, low_, etc.).
  2. They have the same formula of the form !is.na( .data[[ . ]]) ~ .data[[ . ]], where the dot(.) is the dynamically generated name of the column.

What I'd like is something like:

def my_function = function(df) { 
    df %>% mutate(
          "ans_low" := some_func(prefix="Low"),
          "ans_high" := some_func(prefix="High")
}

I tried creating my own case_when generator to replace the standard case_when as shown below, but I'm getting an error. I'm guessing that's because .data doesn't really work outside of the tidyverse functions?

some_func = function(prefix) {
  case_when(
    !is.na(.data[[ sprintf("%s_TOT", prefix) ]]) ~ .data[[ sprintf("%s_TOT", prefix) ]],
    !is.na(.data[[ sprintf("%s_A", prefix) ]]) ~ .data[[ sprintf("%s_A", prefix) ]],
    !is.na(.data[[ sprintf("%s_B", prefix) ]]) ~ .data[[ sprintf("%s_B", prefix) ]]
  )
}

Something else I'm curious about is making an even more general case_when generator. In the examples thus far, it's only the names (prefix) of the columns that are changing. What if I wanted to

  1. change the number and names of suffixes (e.g., high_W, high_X, high_Y, high_Z, low_W, low_X, low_Y, low_Z, .......) and so make a character vector of suffixes an argument of some_func
  2. change the form of the formula. Right now, it's of the form !is.na(.data[[ . ]]) ~ .data[[ . ]] for all the conditions, but what if I wanted to make this an argument of some_func? For example, !is.na(.data[[ . ]]) ~ sprintf("%s is missing", .)

I'd be happy with just getting it to work with different prefixes but it'd be very cool to understand how I could achieve something even more general with arbitrary (but common) suffixes and arbitrary formulae such that I can do some_func(prefix, suffixes, formula).

anonymous1a
  • 785
  • 1
  • 7
  • 12
  • Please show a small reproducible example – akrun Jul 22 '21 at 18:12
  • It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. If you are just trying to get the first non-NA value, a function like `coalesce()` is probably more appropriate. – MrFlick Jul 22 '21 at 18:18
  • See now. I added a simple data set as an example and rewrote the question to be clearer and shorter. It's still a little long because, really, I'm asking 3 questions about the increasing levels of generality that I'd like (and to see if that's possible, to begin with). – anonymous1a Jul 22 '21 at 19:31
  • A `coalesce()` could be a potential answer, but I'm more interested in dynamically generating conditions (`is.na` is just the particular example here and `coalesce` also requires a specific column order). I'm really trying to understand how to **program** with dplyr better and achieve higher levels of abstraction/generality. – anonymous1a Jul 22 '21 at 19:39
  • I also just tried `coalese()` with a prior reordering of the columns but it gives the same major issue: I have to write a whole bunch of `coalesce` statements now. I want to exploit the common prefix of the groups of columns so I don't have to write 10 different `case_when` or `coalese` statements. – anonymous1a Jul 22 '21 at 20:10
  • Why not split up the columns by prefix, using the results of `gsub("_(TOT|[A-Z]+)$", "", ...)` on `colnames()` to determine how they're split up? This accounts for indefinitely many column suffixes: `*_TOT`, `*_A`, `*_B`, `*_C`, ..., `*_Z`, `*_AA`, `*_AB`, ..., and so forth. Then for each of those splits (`"low_"` and `"high_"`), sort their `colnames()` by suffix, given by `str_extract("_(TOT|[A-Z]+)$")`; you'll obviously have to reorder `"_TOT"` as coming first. Then `mutate(paste0("ans_", prefix) = coalesce(everything()))`, and `cbind()` or `bind_cols()` all results back together. – Greg Jul 22 '21 at 20:44

6 Answers6

8

Here is a custom case_when function that you can call with purrr::reduce and a vector of strings parts of your variable names (in the example c("low", "high"):

library(dplyr)
library(purrr)

my_case_when <- function(df, x) {
  
  mutate(df,
         "ans_{x}" := case_when(
           !is.na(!! sym(paste0(x, "_TOT"))) ~ !! sym(paste0(x, "_TOT")),
           !is.na(!! sym(paste0(x, "_A"))) ~ !! sym(paste0(x, "_A")),
           !is.na(!! sym(paste0(x, "_B"))) ~ !! sym(paste0(x, "_B"))
           )
  )
}

test_df %>% 
  reduce(c("low", "high"), my_case_when, .init = .)

#> # A tibble: 3 x 8
#>   low_A low_TOT low_B high_A high_TOT high_B ans_low ans_high
#>   <dbl>   <dbl> <dbl>  <dbl>    <dbl>  <dbl>   <dbl>    <dbl>
#> 1     5      NA    20     NA       NA     60       5       60
#> 2    15      10    25     NA       40     20      10       40
#> 3    NA      NA    30     10       NA     NA      30       10

Created on 2021-07-22 by the reprex package (v0.3.0)

I also have a package on Github {dplyover} which is made for this kind of cases. For your example with more than two variables I would use dplyover::over together with a special syntax to evaluate strings as variable names. We can further use dplyover::cut_names("_TOT") to extract the string parts of the variable names that come before or after "_TOT" (in the example this is "low" and "high").

We can either use case_when:

library(dplyr)
library(dplyover) # https://github.com/TimTeaFan/dplyover

test_df %>% 
  mutate(over(cut_names("_TOT"),
              list(ans = ~ case_when(
                  !is.na(.("{.x}_TOT")) ~ .("{.x}_TOT"),
                  !is.na(.("{.x}_A")) ~ .("{.x}_A"),
                  !is.na(.("{.x}_B")) ~ .("{.x}_B")
                  )),
              .names = "{fn}_{x}")
         )

#> # A tibble: 3 x 8
#>   low_A low_TOT low_B high_A high_TOT high_B ans_low ans_high
#>   <dbl>   <dbl> <dbl>  <dbl>    <dbl>  <dbl>   <dbl>    <dbl>
#> 1     5      NA    20     NA       NA     60       5       60
#> 2    15      10    25     NA       40     20      10       40
#> 3    NA      NA    30     10       NA     NA      30       10

Or somewhat easier coalesce:

test_df %>% 
  mutate(over(cut_names("_TOT"),
              list(ans = ~ coalesce(.("{.x}_TOT"),
                                    .("{.x}_A"),
                                    .("{.x}_B"))),
              .names = "{fn}_{x}")
  )

#> # A tibble: 3 x 8
#>   low_A low_TOT low_B high_A high_TOT high_B ans_low ans_high
#>   <dbl>   <dbl> <dbl>  <dbl>    <dbl>  <dbl>   <dbl>    <dbl>
#> 1     5      NA    20     NA       NA     60       5       60
#> 2    15      10    25     NA       40     20      10       40
#> 3    NA      NA    30     10       NA     NA      30       10

Created on 2021-07-22 by the reprex package (v0.3.0)

TimTeaFan
  • 17,549
  • 4
  • 18
  • 39
  • 1
    Color me intrigued by `dplyover`! How might one do this for arbitrarily many suffixes? Consider: `_TOT`, `_A`, `_B`, ..., `_Z`, `_AA`, `_AB`, ..., and so forth; defined by the regex `_(TOT|[A-Z]+)$`. – Greg Jul 22 '21 at 20:50
  • 1
    @Greg: We can use regex in the selection functions `cut_names` and `extract_names`. However, within the `case_when` function above, we would need to hard code all suffixes, at least if we use `over`. There are also `over2` and `over2x` which will also allow a `.y` argument , but at the end it depends on how the `case_when` function should look like. – TimTeaFan Jul 22 '21 at 21:07
  • 1
    @Greg, I just found a solution (with some edits of my own), that allows for arbitrary suffixes. Look up my solution below. – anonymous1a Jul 22 '21 at 22:26
6

Updated Solution I think this solution solely based on base R may help you.

fn <- function(data) {
  
  do.call(cbind, lapply(unique(gsub("([[:alpha:]]+)_.*", "\\1", names(test_df))), function(x) {
    tmp <- test_df[paste0(x, c("_TOT", "_A", "_B"))]
    tmp[[paste(x, "ans", sep = "_")]] <- Reduce(function(a, b) {
      i <- which(is.na(a))
      a[i] <- b[i]
      a
    }, tmp)
    tmp
  }))
}

fn(test_df)

fn(test_df)

   high_TOT high_A high_B high_ans low_TOT low_A low_B low_ans
1       NA     NA     60       60      NA     5    20       5
2       40     NA     20       40      10    15    25      10
3       NA     10     NA       10      NA    NA    30      30
Anoushiravan R
  • 21,622
  • 3
  • 18
  • 41
6

At the risk of not answering the question, I think the easiest way to approach this is to just reshape and use coalesce(). Your data structure requires two pivots either way (I think) but this requires no careful thinking about what prefixes are present.

library(tidyverse)

test_df <- tibble(
  low_A = c(5, 15, NA),
  low_TOT = c(NA, 10, NA),
  low_B = c(20, 25, 30),
  high_A = c(NA, NA, 10),
  high_TOT = c(NA, 40, NA),
  high_B = c(60, 20, NA)
)

test_df %>%
  rowid_to_column() %>%
  pivot_longer(cols = -rowid, names_to = c("prefix", "suffix"), names_sep = "_") %>%
  pivot_wider(names_from = suffix, values_from = value) %>%
  mutate(ans = coalesce(TOT, A, B)) %>%
  pivot_longer(cols = c(-rowid, -prefix), names_to = "suffix") %>%
  pivot_wider(names_from = c(prefix, suffix), names_sep = "_", values_from = value)
#> # A tibble: 3 x 9
#>   rowid low_A low_TOT low_B low_ans high_A high_TOT high_B high_ans
#>   <int> <dbl>   <dbl> <dbl>   <dbl>  <dbl>    <dbl>  <dbl>    <dbl>
#> 1     1     5      NA    20       5     NA       NA     60       60
#> 2     2    15      10    25      10     NA       40     20       40
#> 3     3    NA      NA    30      30     10       NA     NA       10

Note also that case_when has no tidy evaluation, and so just not using mutate simplifies your some_func a lot. You already got an answer using !!sym inside mutate, so here is a version that illustrates a simpler way. I prefer not to use tidyeval unless necessary because I want to use a mutate chain, and here it's not really needed.

some_func <- function(df, prefix) {
  ans <- str_c(prefix, "_ans")
  TOT <- df[[str_c(prefix, "_TOT")]]
  A <- df[[str_c(prefix, "_A")]]
  B <- df[[str_c(prefix, "_B")]]
  
  df[[ans]] <- case_when(
    !is.na(TOT) ~ TOT,
    !is.na(A) ~ A,
    !is.na(B) ~ B
  )
  df
}

reduce(c("low", "high"), some_func, .init = test_df)
#> # A tibble: 3 x 8
#>   low_A low_TOT low_B high_A high_TOT high_B low_ans high_ans
#>   <dbl>   <dbl> <dbl>  <dbl>    <dbl>  <dbl>   <dbl>    <dbl>
#> 1     5      NA    20     NA       NA     60       5       60
#> 2    15      10    25     NA       40     20      10       40
#> 3    NA      NA    30     10       NA     NA      30       10
Calum You
  • 14,687
  • 4
  • 23
  • 42
  • [My instinct](https://stackoverflow.com/posts/comments/121044897) too was to just _"to just reshape and use `coalesce()`"_. I am curious, though: how might we generalize to arbitrarily many alphabetical suffixes `_A`, `_B`, ..., `_Z`, `_AA`, `_AB`, and so forth? And for prefixes that might themselves contain `_`, as with `another_prefix_A`? Perhaps by pivoting the column names into a `name` column, and then by splitting each `name` into (1) the substring matching the regex `_(TOT|[A-Z]+)$`, and (2) the substring of everything that came before. – Greg Jul 22 '21 at 21:15
  • 1
    To handle arbitrary suffixes, I'd probably want to order the list of suffixes and splice into `coalesce`. For complex prefixes, `pivot_longer` supports `names_pattern`, so you can use use a regex to select the groups you want (e.g. `(^.*)_([^_]+$)` would (I think) make the suffix the last `_` before the end of the string, and the prefix everything before that underscore. – Calum You Jul 22 '21 at 22:18
  • I really liked your base R solution (way more than the multiple pivots). Unfortunately, yours doesn't really solve the problem of generating `case_when` conditions dynamically so I had to choose TimeTeaFan's as the accepted answer. But if I had to do this again, I'd definitely use your base R solution which is much easier to grasp. – anonymous1a Jul 22 '21 at 22:30
3

Thanks for all your answers folks! Calum You's answer specifically made me realise that sticking to the Tidyverse all the time isn't necessarily the best and sometimes base R has a better, simpler, and more elegant solution.

Thans to a ton of searching and this excellent post by noahm on the RStduio community, I was also able to come up with a solution of my own that does what I was looking for:

library(tidyverse)
library(rlang)
library(glue)

make_expr = function(prefix, suffix) {
  rlang::parse_expr(glue::glue('!is.na(.data[[\"{prefix}_{suffix}\"]]) ~ .data[[\"{prefix}_{suffix}\"]]'))
}

make_conds = function(prefixes, suffixes){
  map2(prefixes, suffixes, make_expr)
}

ans_df = test_df %>%  
    mutate(
        "ans_low" := case_when(
            !!! make_conds( prefixes=c("low"), suffixes=c("TOT", "A", "B") ) 
        ),
        "ans_high" := case_when(
            !!! make_conds( prefixes=c("high"), suffixes=c("TOT", "A", "B") ) 
        )
    )

# The ans is the same as the expected solution
> all_equal(ans_df, expected_df)
[1] TRUE

I've also checked that this works inside of a function (which was another important consideration for me).

One benefits of this solution is that the suffixes are not hard-coded and achieve at least the first level of generality I was looking for.

I imagine some string manipulation with replacements could possibly also allow for generality with the structure of the formulae. Ultimately, general formulae would require a string templating solution of some sort because with this structure, you can just keep that into glue.

anonymous1a
  • 785
  • 1
  • 7
  • 12
2

This does not generate any case_when, but you can create the two new columns as follows. Of course this could also be a function with test_df, ans_order, and and_groups as arguments.

ans_order <- c('TOT', 'A', 'B')
ans_groups <- c('low', 'high')

test_df[paste0('ans_', ans_groups)] <- 
  apply(outer(ans_groups, ans_order, paste, sep = '_'), 1, 
        function(x) do.call(dplyr::coalesce, test_df[x]))

test_df
#>   low_A low_TOT low_B high_A high_TOT high_B ans_low ans_high
#> 1     5      NA    20     NA       NA     60       5       60
#> 2    15      10    25     NA       40     20      10       40
#> 3    NA      NA    30     10       NA     NA      30       10

If you'd rather not use any packages, another option is

test_df[paste0('ans_', ans_groups)] <- 
  apply(outer(ans_groups, ans_order, paste, sep = '_'), 1, 
        function(x) Reduce(function(x, y) ifelse(is.na(x), y, x), test_df[x]))
IceCreamToucan
  • 28,083
  • 2
  • 22
  • 38
1

Though the answer has been accepted, I feel this can be done (even for any number of column sets) in dplyr only without the need of writing a custom function earlier.

test_df %>%
  mutate(across(ends_with('_TOT'), ~ coalesce(., 
                                              get(gsub('_TOT', '_A', cur_column())), 
                                              get(gsub('_TOT', '_B', cur_column()))
                                              ),
                .names = "ans_{gsub('_TOT', '', .col)}"))

# A tibble: 3 x 8
  low_A low_TOT low_B high_A high_TOT high_B ans_low ans_high
  <dbl>   <dbl> <dbl>  <dbl>    <dbl>  <dbl>   <dbl>    <dbl>
1     5      NA    20     NA       NA     60       5       60
2    15      10    25     NA       40     20      10       40
3    NA      NA    30     10       NA     NA      30       10

A complete base R approach

Reduce(function(.x, .y) {
  xx <- .x[paste0(.y, c('_TOT', '_A', '_B'))]
  .x[[paste0('ans_',.y)]] <- apply(xx, 1, \(.z) head(na.omit(.z), 1))
  .x
}, unique(gsub('([_]*)_.*', '\\1', names(test_df))),
init = test_df)

# A tibble: 3 x 8
  low_A low_TOT low_B high_A high_TOT high_B ans_low ans_high
  <dbl>   <dbl> <dbl>  <dbl>    <dbl>  <dbl>   <dbl>    <dbl>
1     5      NA    20     NA       NA     60       5       60
2    15      10    25     NA       40     20      10       40
3    NA      NA    30     10       NA     NA      30       10
AnilGoyal
  • 25,297
  • 4
  • 27
  • 45