6

Note The described behaviour has been fixed in the dev version of dplyr. You can install dplyr using devtools::install_github("hadley/dplyr")

Please see this minimal example; I am using dplyr v0.3.0.2 and data.table v1.9.4

library(dplyr)
library(data.table)
f <- function(x, y, bad) { 
  z <- data.table(x,y, key = "x")    
  z2 <- z %>% group_by(x) %>% summarise(sum.bad = sum(y == bad))
  z2
}

f(rnorm(100), rnorm(100) < 0, bad = FALSE) 

When I run the above I get

Error in `[.data.table`(dt, , list(sum.bad = sum(y == bad)), by = vars) : 
  object 'bad' not found

However bad is clearly defined and in scope.

If I just run this outside of a function it works

  x <- rnorm(100)
  y <- rnorm(100) <0
  bad <- FALSE
  z <- data.table(x,y, key = "x")

  z2 <- z %>% group_by(x) %>% summarise(sum.bad = sum(y == bad))
  z2

What is the issue here? Is it a bug with either data.table or dplyr?

xiaodai
  • 14,889
  • 18
  • 76
  • 140
  • 2
    It doesn't throw an error with `dplyr 0.4.0`, `data.table 1.9.4`. – Khashaa Jan 05 '15 at 05:37
  • I got the latest CRAN version which is 0.3.0.2. – xiaodai Jan 05 '15 at 05:40
  • 1
    If you wrap parens around the `summarise` call the error changes to `.data` not found in `summarise_` – Rich Scriven Jan 05 '15 at 05:41
  • I think that due to non-standard evaluation, `bad` is outside the scope by the time you want to evaluate it. Try a `force` call in there – Rich Scriven Jan 05 '15 at 05:49
  • @RichardScriven can you give an example? I don't really know what you mean. – xiaodai Jan 05 '15 at 05:52
  • `f(rnorm(100), rnorm(100) < 0, bad = 0)` returns the same thing as `f(rnorm(100), rnorm(100) < 0, bad = FALSE)` – Khashaa Jan 05 '15 at 05:55
  • @Khashaa - you're right. It failed for me too I had `bad` defined in global. This sounds like it's beyond my level of expertise :) but definitely a scoping issue.I suggest OP put a call to `print(ls.str())` all over the place. It looks like `bad` is at the very top level and the environments changed twice in that fun call – Rich Scriven Jan 05 '15 at 06:02
  • I thought 0.4.0 version was already on CRAN http://rpubs.com/hadley/52611 – Khashaa Jan 05 '15 at 06:05
  • I can't tell you why exactly it failed but here's a similar function that works with dplyr 0.3.0.2: `f <- function(x, y, bad) {data_frame(x, y) %>% arrange(x) %>% group_by(x) %>% summarise(sum.bad = if (bad) sum(y) else sum(!y))}` – talat Jan 05 '15 at 07:05

1 Answers1

4

Seems like this is a problem with how dplyr is setting up the environment to the data.table call. The problem appears in the dplyr:::summarise_.grouped_dt function. It currently looks like

function (.data, ..., .dots) 
{
    dots <- lazyeval::all_dots(.dots, ..., all_named = TRUE)
    for (i in seq_along(dots)) {
        if (identical(dots[[i]]$expr, quote(n()))) {
            dots[[i]]$expr <- quote(.N)
        }
    }
    list_call <- lazyeval::make_call(quote(list), dots)
    call <- substitute(dt[, list_call, by = vars], list(list_call = list_call$expr))
    env <- dt_env(.data, parent.frame())
    out <- eval(call, env)
    grouped_dt(out, drop_last(groups(.data)), copy = FALSE)
}
<environment: namespace:dplyr>

and if we debug that function and look at the trace when it's called, we see

where 1: summarise_.grouped_dt(.data, .dots = lazyeval::lazy_dots(...))
where 2: summarise_(.data, .dots = lazyeval::lazy_dots(...))
where 3: summarise(., sum.bad = sum(y == bad))
where 4: function_list[[k]](value)
where 5: withVisible(function_list[[k]](value))
where 6: freduce(value, `_function_list`)
where 7: `_fseq`(`_lhs`)
where 8: eval(expr, envir, enclos)
where 9: eval(quote(`_fseq`(`_lhs`)), env, env)
where 10: withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
where 11 at #3: z %>% group_by(x) %>% summarise(sum.bad = sum(y == bad))
where 12: f(rnorm(100), rnorm(100) < 0, bad = FALSE)

So the important line is the

env <- dt_env(.data, parent.frame())

one. Here it's setting up the environment path which specifies where to look up all variables in the call. Here it's just using the parent.frame which is looks to where the function was called from, but since you actually jump through a few hoops to get to this function from your summarize call inside f(), this doesn't seem to be the right parent frame. If, instead you run

env <- dt_env(.data, parent.frame(2))

in debug mode, that seems to actually get at the correct parent frame. So i think the problem is the jump from summarize() to summarize_() because this

ff <- function(x, y, bad) { 
  z <- data.table(x,y, key = "x")    
  z2 <- z %>% group_by(x) %>% summarise_(.dots=list(sum.bad = quote(sum(y == bad))))
  z2
}

ff(rnorm(100), rnorm(100) < 0, bad = FALSE) 

seems to work. So it's really dplyr that needs to set up the correct environment. The tricky part is that appears to be different if you call summarize or summarize_ directly. Perhaps summarise() could change the environment when it calls summarise_ to have the same parent.frame via eval(). But I'd probably file this as a bug report and let Hadley decide how to fix it. Something like

summarise <- function(.data, ...) {
  call <- match.call()
  call <- as.call(c(as.list(call)[1:2], list(.dots=as.list(call)[-(1:2)])))
  call[[1]] <- quote(summarise_)
  eval(call, envir=parent.frame())
}

would be a "traditional" way to do it. Not sure if the lazyeval package has nicer ways to do this or not.

Tested with data.table_1.9.2 and dplyr_0.3.0.2

MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • depends what you mean by "nicer", but `<<-` within the function will put the variable into the scope that it'll be checked in. Make sure you only use one named temporary global variable per function (ie. don't call them all `tempvar` or you could run into some problems that are difficult to debug) – hedgedandlevered Aug 12 '16 at 20:43