17

Goal

My goal is to define some functions for use within dplyr verbs, that use pre-defined variables. This is because I have some of these functions that take a bunch of arguments, of which many always are the same variable names.

My understanding: This is difficult (and perhaps impossible) because dplyr will lazily evaluate user-specified variables later on, but any default arguments are not in the function call and therefore invisible to dplyr.

Toy example

Consider the following example, where I use dplyr to calculate whether a variable has changed or not (rather meaningless in this case):

library(dplyr)
mtcars  %>%
  mutate(cyl_change = cyl != lag(cyl))

Now, lag also supports alternate ordering like so:

mtcars  %>%
  mutate(cyl_change = cyl != lag(cyl, order_by = gear))

But what if I'd like to create my own version of lag that always orders by gear?

Failed attempts

The naive approach is this:

lag2 <- function(x, n = 1L, order_by = gear) lag(x, n = n, order_by = order_by)

mtcars %>%
  mutate(cyl_change = cyl != lag2(cyl))

But this obviously raises the error:

no object named ‘gear’ was found

More realistic options would be these, but they also don't work:

lag2 <- function(x, n = 1L) lag(x, n = n, order_by = ~gear)
lag2 <- function(x, n = 1L) lag(x, n = n, order_by = get(gear))
lag2 <- function(x, n = 1L) lag(x, n = n, order_by = getAnywhere(gear))
lag2 <- function(x, n = 1L) lag(x, n = n, order_by = lazyeval::lazy(gear))

Question

Is there a way to get lag2 to correctly find gear within the data.frame that dplyr is operating on?

  • One should be able to call lag2 without having to provide gear.
  • One should be able to use lag2 on datasets that are not called mtcars (but do have gear as one it's variables).
  • Preferably gear would be a default argument to the function, so it can still be changed if required, but this is not crucial.
Axeman
  • 32,068
  • 8
  • 81
  • 94
  • `gear` is another vector right? You're not passing it to the local environment of `lag2`. Try `lag2 <- function(x, gear) {...}` (note, no need for param `n` as written). – alexwhitworth Mar 29 '16 at 16:41
  • `gear` is a variable in `mtcars`. Yeah I goofed the `n` argument. – Axeman Mar 29 '16 at 16:44
  • Your function `lag2` requires a parameter vector `gear`. But you're not passing `gear` to the function... rewrite your function so that gear is passed to it. – alexwhitworth Mar 29 '16 at 16:53
  • Sorry I wasn't clear. I want to have a version of `lag`, where `order_by` is set to `gear` by default, without me needing to specify that. `lag2 <- function(x, gear) lag(x, order_by = gear)` needs specification of the `gear` argument. `lag2 <- function(x, gear = gear) lag(x, order_by = gear)` is illegal (can't do x = x in the arguments). `lag2 <- function(x, y = gear) lag(x, order_by = y)` will just give the same error as my first attempt in the question. – Axeman Mar 29 '16 at 17:01
  • The vector "gear" needs to be defined globally or passed to the function. Use either lag2 <- function(x) lag(x, order_by = mtcars$gear) or redefine your function as per Alex: lag2 <- function(x, gear) lag(x, order_by = gear), mtcars %>% mutate(cyl_change = cyl != lag2(cyl, gear)) – Dave2e Mar 29 '16 at 19:52
  • Sure. But I was hoping I could use NSE and lazy evaluation to only evaluate `gear` within the context of `summarize` where other bare variable names are correctly evaluated. Likely by fetching it from the correct environment.. – Axeman Mar 29 '16 at 21:19
  • 2
    @Axeman if you want, I can suggest a couple of approaches that will almost get you to where you want in `data.table`, but neither of them would work with `dplyr` – eddi Mar 31 '16 at 21:18
  • 1
    @eddi Always happy to learn, but I'm already very committed to (`multi`)`dplyr` for this specific project. – Axeman Mar 31 '16 at 21:31
  • 2
    I never understood how people become committed (trapped?) to using just a particular library in R, which is specifically designed to use a lot of libraries. – eddi Apr 01 '16 at 02:33
  • @eddi, you're right, of course. I'm actually writing a package that includes methods for most `dplyr` verbs as way of dealing with a specific class of object. Since those verbs are the interface, I'd like to keep that consistent. I _can_ run `data.table` code in the `mutate_` method, but capturing of the arguments is done by the S3 generic and I suppose this will pose some restrictions on possible solutions. I'm still interested in seeing the `data.table` solutions. – Axeman Apr 01 '16 at 06:07
  • I also think DT is the way to go here, @eddi ... But if the OP is fixated on dplyr, that's the way it is. – alexwhitworth Apr 03 '16 at 20:19

5 Answers5

10

Here are two approaches in data.table, however I don't believe that either of them will work in dplyr at the present.

In data.table, whatever is inside the j-expression (aka the 2nd argument of [.data.table) gets parsed by the data.table package first, and not by regular R parser. In a way you can think of it as a separate language parser living inside the regular language parser that is R. What this parser does, is it looks for what variables you have used that are actually columns of the data.table you're operating on, and whatever it finds it puts it in the environment of the j-expression.

What this means, is that you have to let this parser know somehow that gear will be used, or it simply will not be part of the environment. Following are two ideas for accomplishing that.

The "simple" way to do it, is to actually use the column name in the j-expression where you call lag2 (in addition to some monkeying within lag2):

dt = as.data.table(mtcars)

lag2 = function(x) lag(x, order_by = get('gear', sys.frame(4)))

dt[, newvar := {gear; lag2(cyl)}]
# or
dt[, newvar := {.SD; lag2(cyl)}]

This solution has 2 undesirable properties imo - first, I'm not sure how fragile that sys.frame(4) is - you put this thing in a function or a package and I don't know what will happen. You can probably work around it and figure out the right frame, but it's kind of a pain. Second - you either have to mention the particular variable you're interested in, anywhere in the expression, or dump all of them in the environment by using .SD, again anywhere.

A second option that I like more, is to take advantage of the fact that the data.table parser evaluates eval expressions in place before the variable lookup, so if you use a variable inside some expression that you eval, that would work:

lag3 = quote(function(x) lag(x, order_by = gear))

dt[, newvar := eval(lag3)(cyl)]

This doesn't suffer from the issues of the other solution, with the obvious disadvantage of having to type an extra eval.

eddi
  • 49,088
  • 6
  • 104
  • 155
  • This got me thinking that perhaps functions should be "evaluated in place" as well, similar to `eval` (basically the whole function expression copy-pasted into your expression), but this will probably add an insane amount of overhead (basically doing everything that R parser does, using R functions) and is not worth it. – eddi Apr 01 '16 at 16:15
4

This solution is coming close:

Consider a slightly easier toy example:

mtcars %>%
  mutate(carb2 = lag(carb, order_by = gear))

We still use lag and it's order_by argument, but don't do any further computation with it. Instead of sticking to the SE mutate, we switch to NSE mutate_ and make lag2 build a function call as a character vector.

lag2 <- function(x, n = 1, order_by = gear) {
  x <- deparse(substitute(x))
  order_by <- deparse(substitute(order_by))
  paste0('dplyr::lag(x = ', x, ', n = ', n, ', order_by = ', order_by, ')')
}

mtcars %>%
  mutate_(carb2 = lag2(carb))

This gives us an identical result to the above.

The orginial toy example can be achieved with:

mtcars %>%
  mutate_(cyl_change = paste('cyl !=', lag2(cyl)))

Downsides:

  1. We have to use the SE mutate_.
  2. For extended usage as in the original example we need to also use paste.
  3. This is not particularly safe, i.e. it is not immediately clear where gear should come from. Assigning values to gear or carb in the global environment seems to be ok, but my guess is that unexpected bugs could occur in some cases. Using a formula instead of a character vector would be safer, but this requires the correct environment to be assigned for it to work, and that is still a big question mark for me.
Axeman
  • 32,068
  • 8
  • 81
  • 94
3

This isn't elegant, as it requires an extra argument. But, by passing the entire data frame we get nearly the required behavior

lag2 <- function(x, df, n = 1L, order_by = df[['gear']], ...) {
  lag(x, n = n, order_by = order_by, ...)
}

hack <- mtcars  %>%  mutate(cyl_change = cyl != lag2(cyl, .))
ans <- mtcars  %>%  mutate(cyl_change = cyl != lag(cyl, order_by = gear))
all.equal(hack, ans)
# [1] TRUE
  1. One should be able to call lag2 without having to provide gear.

Yes, but you need to pass ..

  1. One should be able to use lag2 on datasets that are not called mtcars (but do have gear as one it's variables).

This works.

  1. Preferably gear would be a default argument to the function, so it can still be changed if required, but this is not crucial.

This also works:

hack_nondefault <- mtcars %>%  mutate(cyl_change = cyl != lag2(cyl, order_by = cyl))
ans_nondefault <- mtcars %>%  mutate(cyl_change = cyl != lag(cyl, order_by = cyl))
all.equal(hack_nondefault, ans_nondefault)
# [1] TRUE

Note that if you manually give order_by, specifying df with the . is not longer necessary and usage becomes identical to the original lag (which is very nice).

Addendum

It seems hard to avoid using SE mutate_ as in the answer posed by the OP, to do some simple hackery like in my answer here, or to do something more advanced involving reverse-engineering lazyeval::lazy_dots.

Evidence:

1) dplyr::lag itself doesn't use any NSE wizardry

2) mutate simply calls mutate_(.data, .dots = lazyeval::lazy_dots(...))

Axeman
  • 32,068
  • 8
  • 81
  • 94
jaimedash
  • 2,683
  • 17
  • 30
  • This is a good solution as well, let me think about it some more. The passing of a character vector is no problem as this can be fixed. – Axeman Apr 05 '16 at 04:56
  • 1
    thanks! I'm trying to learn about NSE myself but sort despaired here after noticing the facts my addendum. this is a real challenge. please feel free to edit (or comment, and I'll edit) if it's easy to change this so just `gear` can be passed. I wasn't immediately sure how to do it – jaimedash Apr 05 '16 at 05:37
  • 1
    Have a look at my answer for a solution to that :) – Axeman Apr 05 '16 at 05:47
  • 1
    ha! I should have read more carefully. so, `lag3 <- function(x, df, n = 1L, order_by=gear, ...) { order_by <- deparse(substitute(order_by)) lag(x, n = n, order_by = df[[order_by]], ...) }`? But, though that works for the first case it fails in the third: `mutate(cyl_change = cyl != lag3(cyl, ., order_by = cyl))` evals `cyl` to `c(6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8, 8, 8, 8, 4, 4, 4, 8, 6, 8, 4))`. :\ – jaimedash Apr 05 '16 at 06:03
  • I see... I don't quite understand but it seems the `.` is messing that up? – Axeman Apr 05 '16 at 06:16
  • 1
    Ok, I've now fixed that small issue now (and made an edit). I think this is the best answer I'm going to get and pretty near perfect! – Axeman Apr 07 '16 at 15:35
1

Here is my eventual answer that I actually ended up using. It fundamentally relies on a function that explicitly injects any default function values into the expressions of a lazy dots object.

The complete function (with comments) is at the end of this answer.

Limitations:

  • You need at least some additional tricks to make this work nicely (see below).
  • It ignores primitive functions, but I don't think these have default function arguments.
  • For S3 generics, one should use the actual method instead. Such as seq.default instead of seq. If the goal is injection of default values in your own functions, then this generally won't be much of a problem.

For example, one can use this function like this:

dots <- lazyeval::all_dots(a = ~x, b = ~lm(y ~ x, data = d))
add_defaults_to_dots(dots)
$a
<lazy>
  expr: x
  env:  <environment: R_GlobalEnv>

$b
<lazy>
  expr: lm(formula = y ~ x, data = d, subset = , weights = , na.action = ,  ...
  env:  <environment: R_GlobalEnv>

We can solve the toy problem from the question in several ways. Remember the new function and the ideal use case:

lag2 <- function(x, n = 1L, order_by = gear) lag(x, n = n, order_by = order_by)

mtcars %>%
  mutate(cyl_change = cyl != lag2(cyl))
  1. Use mutate_ with dots directly:

    dots <- lazyeval::all_dots(cyl_change = ~cyl != lag2(cyl), all_named = TRUE)
    dots <- add_defaults_to_dots(dots)
    mtcars %>% mutate_(.dots = dots)
    
  2. Redefine mutate to include the addition of defaults.

    mutate2 <- function(.data, ...) {
      dots <- lazyeval::lazy_dots(...)
      dots <- add_defaults_to_dots(dots)
      dplyr::mutate_(.data, .dots = dots)
    }
    
    mtcars %>% mutate2(cyl_change = cyl != lag2(cyl))
    
  3. Use S3 dispatch to do this as the default for any custom class:

    mtcars2 <- mtcars
    class(mtcars2) <- c('test', 'data.frame')
    
    mutate_.test <- function(.data, ..., .dots) {
      dots <- lazyeval::all_dots(.dots, ..., all_named = TRUE)
      dots <- add_defaults_to_dots(dots)
      dplyr::mutate_(tibble::as_tibble(.data), .dots = dots)
    }
    mtcars2 %>% mutate(cyl_change = cyl != lag2(cyl))
    

Depending on the use case, options 2 and 3 are the best ways to accomplish this I think. Option 3 actually has the complete suggested use case, but does rely on an additional S3 class.

Function:

add_defaults_to_dots <- function(dots) {
  # A recursive function that continues to add defaults to lower and lower levels.
  add_defaults_to_expr <- function(expr) {
    # First, if a call is a symbol or vector, there is nothing left to do but
    # return the value (since it is not a function call).
    if (is.symbol(expr) | is.vector(expr) | class(expr) == "formula") {
      return(expr)
    }
    # If it is a function however, we need to extract it.
    fun <- expr[[1]]
    # If it is a primitive function (like `+`) there are no defaults, and we
    # should not manipulate that call, but we do need to use recursion for cases
    # like a + f(b).
    if (is.primitive(match.fun(fun))) {
      new_expr <- expr
    } else {
      # If we have an actual non-primitive function call, we formally match the
      # call, so abbreviated arguments and order reliance work.
      matched_expr <- match.call(match.fun(fun), expr, expand.dots = TRUE)
      expr_list <- as.list(matched_expr)
      # Then we find the default arguments:
      arguments <- formals(eval(fun))
      # And overwrite the defaults for which other values were supplied:
      given <- expr_list[-1]
      arguments[names(given)] <- given
      # And finally build the new call:
      new_expr <- as.call(c(fun, arguments))
    }
    # Then, for all function arguments we run the function recursively.
    new_arguments <- as.list(new_expr)[-1]
    null <- sapply(new_arguments, is.null)
    new_arguments[!null] <- lapply(new_arguments[!null], add_defaults_to_expr)
    new_expr <- as.call(c(fun, new_arguments))
    return(new_expr)
  }
  # For lazy dots supplied, separate the expression and environments.
  exprs <- lapply(dots, `[[`, 'expr')
  envrs <- lapply(dots, `[[`, 'env')
  # Add the defaults to the expressions.
  new_exprs <- lapply(exprs, add_defaults_to_expr)
  # Add back the correct environments.
  new_calls <- Map(function(x, y) {
    lazyeval::as.lazy(x, y)
  }, new_exprs, envrs)
  return(new_calls)
}
Axeman
  • 32,068
  • 8
  • 81
  • 94
1

You can also solve your problem in the following way:

library(dplyr)

lag2 <- function(df, x, n = 1L, order_by = gear) {
  order_var <- enquo(order_by)
  x <- enquo(x)
  var_name <- paste0(quo_name(x), "_change")

  df %>% 
    mutate(!!var_name := lag(!!x, n = n, order_by = !!order_var))
}

mtcars %>%
  lag2(cyl)

# A tibble: 32 x 12
#      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb cyl_change
#    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>      <dbl>
#  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4          8
#  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4          6
#  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1          6
#  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1         NA
#  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2          6
#  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1          8
#  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4          6
#  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2          4
#  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2          4
# 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4          4
# ... with 22 more rows

I'm aware, that again the dataframe has to be passed on in the function, but in that way the environment where gear is expected is clearer. Also the piping nature is preserved nicely as well as automatic defining the name of the new variable.

Comment: I'm pretty sure this solution wasn't available when you first posted this question, but nevertheless it might be nice to keep this here for future reference.

kath
  • 7,624
  • 17
  • 32