7

I have a function which returns a tibble. It runs OK, but I want to vectorize it.

library(tidyverse)

tibTest <- tibble(argX = 1:4, argY = 7:4)

square_it <- function(xx, yy) {
  if(xx >= 4){
    tibble(x = NA, y = NA)
  } else if(xx == 3){
    tibble(x = as.integer(), y = as.integer())
  } else if (xx == 2){
    tibble(x = xx^2 - 1, y = yy^2 -1)
  } else {
    tibble(x = xx^2, y = yy^2)
  }
}

It runs OK in a mutate when I call it with map2, giving me the result I wanted:

tibTest %>%
  mutate(sq = map2(argX, argY, square_it)) %>%
  unnest()
## A tibble: 3 x 4
#     argX  argY     x     y
#    <int> <int> <dbl> <dbl>
# 1     1     7     1    49
# 2     2     6     3    35
# 3     4     4    NA    NA

My first attempt to vectorize it failed, and I can see why - I can't return a vector of tibbles.

square_it2 <- function(xx, yy){
  case_when(
    x >= 4 ~ tibble(x = NA, y = NA),
    x == 3 ~ tibble(x = as.integer(), y = as.integer()),
    x == 2 ~ tibble(x = xx^2 - 1, y = yy^2 -1),
    TRUE   ~ tibble(x = xx^2,     y = yy^2)
  )
}
# square_it2(4, 2)  # FAILS

My next attempt runs OK on a simple input. I can return a list of tibbles, and that's what I want for the unnest

square_it3 <- function(xx, yy){
  case_when(
    xx >= 4 ~ list(tibble(x = NA, y = NA)),
    xx == 3 ~ list(tibble(x = as.integer(), y = as.integer())),
    xx == 2 ~ list(tibble(x = xx^2 - 1, y = yy^2 -1)),
    TRUE   ~ list(tibble(x = xx^2,     y = yy^2))
  )
}
square_it3(4, 2)
# [[1]]
# # A tibble: 1 x 2
# x     y    
# <lgl> <lgl>
#   1 NA    NA   

But when I call it in a mutate, it doesn't give me the result I had with square_it. I can sort of see what's wrong. In the xx == 2 clause, xx acts as an atomic value of 2. But in building the tibble, xx is a length-4 vector.

tibTest %>%
  mutate(sq =  square_it3(argX, argY)) %>%
  unnest()
# # A tibble: 9 x 4
#    argX  argY     x     y
#    <int> <int> <dbl> <dbl>
# 1     1     7     1    49
# 2     1     7     4    36
# 3     1     7     9    25
# 4     1     7    16    16
# 5     2     6     0    48
# 6     2     6     3    35
# 7     2     6     8    24
# 8     2     6    15    15
# 9     4     4    NA    NA

How do I get the same result as I did with square_it, but from a vectorized function using case_when ?

David T
  • 1,993
  • 10
  • 18

2 Answers2

3

We define row_case_when which has a similar formula interface as case_when except it has a first argument of .data, acts by row and expects that the value of each leg to be a data frame. It returns a data.frame/tibble. Wrapping in a list, rowwise and unnest are not needed.

case_when2 <- function (.data, ...) {
    fs <- dplyr:::compact_null(rlang:::list2(...))
    n <- length(fs)
    if (n == 0) {
        abort("No cases provided")
    }
    query <- vector("list", n)
    value <- vector("list", n)
    default_env <- rlang:::caller_env()
    quos_pairs <- purrr::map2(fs, seq_along(fs), dplyr:::validate_formula,
        rlang:::default_env, rlang:::current_env())
    for (i in seq_len(n)) {
        pair <- quos_pairs[[i]]
        query[[i]] <- rlang::eval_tidy(pair$lhs, data = .data, env = default_env)
        value[[i]] <- rlang::eval_tidy(pair$rhs, data = .data, env = default_env)
        if (!is.logical(query[[i]])) {
            abort_case_when_logical(pair$lhs, i, query[[i]])
        }
        if (query[[i]]) return(value[[i]])
    }
}

row_case_when <- function(.data, ...) {
  .data %>% 
    group_by(.group = 1:n(), !!!.data) %>%
    do(case_when2(., ...)) %>%
    mutate %>%
    ungroup %>%
    select(-.group)
}

Test run

It is used like this:

library(dplyr)

tibTest <- tibble(argX = 1:4, argY = 7:4) # test data from question

tibTest %>%
  row_case_when(argX >= 4 ~ tibble(x = NA, y = NA),
    argX == 3 ~ tibble(x = as.integer(), y = as.integer()),
    argX == 2 ~ tibble(x = argX^2 - 1, y = argY^2 -1),
    TRUE   ~ tibble(x = argX^2,     y = argY^2)
  )

giving:

# A tibble: 3 x 4
   argX  argY     x     y
  <int> <int> <dbl> <dbl>
1     1     7     1    49
2     2     6     3    35
3     4     4    NA    NA

mutate_cond and mutate_when

These are not quite the same as row_case_when since they don't run through conditions taking the first true one but by using mutually exclusive conditions they can be used for certain aspects of this problem. They do not handle changing the number of rows in the result but we can use dplyr::filter to remove rows for a particular condition.

mutate_cond defined in dplyr mutate/replace several columns on a subset of rows is like mutate except the second argument is a condition and the subsequent arguments are applied only to rows for which that condition is TRUE.

mutate_when defined in dplyr mutate/replace several columns on a subset of rows is similar to case_when except it applies to rows, the replacement values are provided in a list and alternate arguments are conditions and lists. Also all legs are always run applying the replacement values to the rows satisfying the conditions (as opposed to, for each row, performing the replacement on just the first true leg). To get a similar effect to row_case_when be sure that the conditions are mutually exclusive.

# mutate_cond example
tibTest %>%
  filter(argX != 3) %>%
  mutate(x = NA_integer_, y = NA_integer_) %>%
  mutate_cond(argX == 2, x = argX^2 - 1L, y = argY^2 - 1L) %>%
  mutate_cond(argX < 2, x = argX^2, y = argY^2)

# mutate_when example
tibTest %>%
  filter(argX != 3) %>%
  mutate_when(TRUE, list(x = NA_integer_, y = NA_integer_),
              argX == 2, list(x = argX^2 - 1L, y = argY^2 - 1L), 
              argX < 2, list(x = argX^2, y = argY^2))
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • This produces a nice interface with a great intuitive syntax. If I understand correctly though, it still isn't vectorized because `row_case_when` has to call `rowwise` which under the hood means `case_when2` is being looped inside a map function (or equivalent)? Or have I misunderstood? – Allan Cameron May 17 '20 at 15:21
  • 1
    Surely vectorized means that it takes a data frame and acts row by row without you doing it yourself. as opposed to how it works internally. Ultimately there is always a loop somewhere even if it is deep in C code. Note that because of the way case_when acts in this case it is more complicated then just moving rowwise inside if we want to avoid wrapping each leg in a list. – G. Grothendieck May 17 '20 at 15:26
  • 1
    I think that's a perfectly good definition of "vectorised", but it seems that a lot of R users take exception to something being called vectorised if it is created by a loop within R, as opposed to a loop in the underlying C code, since it has a far greater overhead. I think this is what @RonakShah meant when he said "this isn't really vectorised". In a way it is a useful distinction to preserve because the performance difference is so marked that R users will try to avoid loops, apply functions and maps if a native C way to vectorise exists. Maybe I'm wrong about the community use of the term – Allan Cameron May 17 '20 at 17:31
  • You are likely not going to get any significant speedup from writing it in C because the time will be dominated by the R code in the legs which has to be iterated over. – G. Grothendieck May 17 '20 at 18:08
  • I avoid `loops`, `apply` functions and `maps` and use vectorised functions. Not just for efficiency reason, but because also because they save me steps (where I'm also likely to introduce errors.) And they remove _clutter_ from my code, without making it cryptic. IMHO, using `case-when` gives me much more legible code than any of the other conditionals (`if/else`, `ifelse`, `if_else`, `switch`...) But I've had trouble when I tried to get `case_when` to return tibbles. – David T May 17 '20 at 18:31
  • Which is why I welcome `row_case_when`. I've already copied it into my personal utility-package. I'll be using it _this week_ in my code development. – David T May 17 '20 at 18:33
  • Have added discussion of mutate_cond and mutate_when which were defined in the cited SO posts. – G. Grothendieck May 17 '20 at 22:28
2

You need to ensure you are creating a 1-row tibble with each call of the function, then vectorize that.

This works whether you have rowwise groups or not.

You can do this with switch wrapped in a map2:

Here's a reprex:

library(tidyverse)

tibTest <- tibble(argX = 1:4, argY = 7:4)

square_it <- function(xx, yy) {
  map2(xx, yy, function(x, y){
    switch(which(c(x >= 4, 
                   x == 3, 
                   x == 2, 
                   x < 4 & x != 3 & x != 2)),
           tibble(x = NA, y = NA),
           tibble(x = as.integer(), y = as.integer()),
           tibble(x = x^2 - 1, y = y^2 -1),
           tibble(x = x^2, y = y^2))})
}

tibTest %>% mutate(sq =  square_it(argX, argY)) %>% unnest(cols = sq)
#> # A tibble: 3 x 4
#>    argX  argY     x     y
#>   <int> <int> <dbl> <dbl>
#> 1     1     7     1    49
#> 2     2     6     3    35
#> 3     4     4    NA    NA

Created on 2020-05-16 by the reprex package (v0.3.0)

Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
  • Thanks, but I was specifically trying to use `case_when`. – David T May 16 '20 at 14:22
  • 1
    I think that's the problem @DavidT. `case_when` is designed for applying to vectors, but you are wanting to use it specifically for length-1 vectors. Passing rowwise groups is one way of doing that, but from the perspective of software design, you should aim to have the function work on its own inside a `mutate` call. If you prefer `case_when` syntax, I will modify the answer. – Allan Cameron May 16 '20 at 14:28
  • Yes, please do. – David T May 16 '20 at 15:12
  • `case_when`, if I understand it right, takes its vector inputs and applies its conditional logic to each element of the vector. In my code, inside the `case_when`, the line `xx == 2 ~ list(tibble(x = xx^2 - 1, y = yy^2 -1))` treats the `xx` to the left of the `~` as a single vector-element, but the `xx` to the right of the value as the entire vector. – David T May 16 '20 at 15:17
  • 1
    @DavidT that's right - so if your input is `xx = c(2, 2, 2)` then for **each** element in `xx` you will create a tibble with three rows. You will therefore end up with 9 rows when you `unnest()`, which isn't what you wanted. It works `rowwise` because `xx` will only have a single member. But then, why use `case_when`? – Allan Cameron May 16 '20 at 15:32
  • 1
    I'm trying to vectorize all the functions I write, when possible. I _like_ `case_when`. Most of the time, when I use it, I can save myself a `map`ing, and I prefer its syntax to `switch`. So I was trying to see just how far I could push it - what the limits are. And I guess I just found out. :-) Thanks for your help. – David T May 16 '20 at 15:43
  • This isn't really vectorized right? I mean this is same as `square_it`. The only difference is `map2` which was outside the function is now inside. @DavidT – Ronak Shah May 17 '20 at 00:55
  • @ronak-shah Right. I mean, it's technically vectorized - it will take vectors as input. But it's the `map` that's breaking things apart into individual cases, not the `case_when`, which is what I was looking for. – David T May 17 '20 at 01:06
  • @RonakShah you're right, it's not truly vectorised. The point of this answer was to show why there were too many rows being produced in the third case, and that `case_when` probably wasn't the right approach to getting the desired output. There are better ways of getting the desired output here than unnesting tibbles of course. The difficulty I had with `case_when` was getting it to produce a one-row tibble for each value of `xx` and `yy` without first splitting the columns row-wise (either by `rowwise` or `map2`). I also wanted to point out that this should be done inside the function. – Allan Cameron May 17 '20 at 01:19
  • @RonakShah if you can demonstrate a way of getting a single row tibble for each element of two columns using a `case_when` without first making row-wise groups or using map, I'd be very grateful to learn, and it would be a better answer to the OPs question. – Allan Cameron May 17 '20 at 01:25