7

I just ran into some weird behavior of dplyr where summarize kept referring to objects from a previous group.

Here is a simple reproducible example to illustrate the surprising behavior:

library(dplyr, warn.conflicts = FALSE)
tibble(x = rep(letters[1:3], times = 4),
       y = rnorm(12)) %>%
  group_by(x) %>%
  summarize(z1 = sum(y),
            z2 = {
              attr(y, "test") <- "test"
              sum(y)
            })
#> # A tibble: 3 × 3
#>   x         z1    z2
#>   <chr>  <dbl> <dbl>
#> 1 a      0.602 0.602
#> 2 b      1.22  0.602
#> 3 c     -0.310 0.602

Created on 2022-10-31 by the reprex package (v2.0.1)

I expected that z1 and z2 are identical. I don't understand why setting an attribute for the vector y means that in later iterations, the reference to the ''correct'' elements of y is shadowed.

The problem can be easily fixed by using sum(.data$y) in the last line, but I would like to understand the scoping rules within the non-standard evaluation of summarize. Any pointers to helpful documentation or explanations why the current behavior makes sense in the tidyverse non-standard evaluation framework makes sense is appreciated.


I am using R 4.1.1 with dplyr 1.0.7.

const-ae
  • 2,076
  • 16
  • 13
  • I agree; my current theory is that setting the attribute somehow means that a new copy of `y` is created in whatever environment `dplyr` is using to evaluate the content of `summarize`. I hope to understand better how these environments play together to avoid subtle bugs in the future. – const-ae Oct 31 '22 at 10:18
  • Really interesting - I found that `this_y <<- y` gives identical `y`s regardless of whether y is assigned before or after the adjustment of attributes – Captain Hat Oct 31 '22 at 10:19
  • NB `magrittr::set_attr()` behaves as one would expect in this context. – Captain Hat Oct 31 '22 at 10:20
  • It's definitely related to the fact `y` is a grouping variable - assigning `y` to something else and changing its attributes yields the expected result. – Captain Hat Oct 31 '22 at 10:24
  • 1
    I've deleted my comment as my initial suspicion was wrong and Allan Cameron has very nicely demonstrated what is happening. Only thing I would add is the best way to avoid bugs like this is not to assign to a column of the entire dataframe in curly braces in a pipe after applying grouping - I think on its own that is a code smell... – SamR Oct 31 '22 at 10:25

1 Answers1

4

This is a problem related to scoping. If you write to the variable y inside summarize, then the first grouping of your data's y variable is copied into a local variable called y that is distinct from the y in your data frame. Because it is a local variable, it is found on the search path before the y in the passed data frame. Since the same environment is used for subsequent groups' calculations inside summarize, this local variable persists for each group.

We can see this if we do:

library(dplyr, warn.conflicts = FALSE)

set.seed(1)

tibble(x = rep(letters[1:3], times = 4),
       y = rnorm(12)) %>%
  group_by(x) %>% 
  summarize(z1 = sum(y),
            z2 = {
              y <- y
              sum(y)
            }) 
#> # A tibble: 3 x 3
#>   x         z1    z2
#>   <chr>  <dbl> <dbl>
#> 1 a      1.15   1.15
#> 2 b      2.76   1.15
#> 3 c     -0.690  1.15

As long as we remove the local copy of the y variable from the local frame, this doesn't happen:

library(dplyr, warn.conflicts = FALSE)

set.seed(1)

tibble(x = rep(letters[1:3], times = 4),
       y = rnorm(12)) %>%
  group_by(x) %>% 
  summarize(z1 = sum(y),
            z2 = {
              attr(y, "test") <- "test"
              x <- sum(y)
              rm(y)
              x
            }) 
#> # A tibble: 3 x 3
#>   x         z1     z2
#>   <chr>  <dbl>  <dbl>
#> 1 a      1.15   1.15 
#> 2 b      2.76   2.76 
#> 3 c     -0.690 -0.690

Or better still, don't write to a local variable with the same name as a variable in your data frame:

tibble(x = rep(letters[1:3], times = 4),
       y = rnorm(12)) %>%
  group_by(x) %>% 
  summarize(z1 = sum(y),
            z2 = {
              new_y <- y
              attr(new_y, "test") <- "test"
              sum(new_y)
            }) 
#> # A tibble: 3 x 3
#>   x         z1     z2
#>   <chr>  <dbl>  <dbl>
#> 1 a      1.15   1.15 
#> 2 b      2.76   2.76 
#> 3 c     -0.690 -0.690

Created on 2022-10-31 with reprex v2.0.2

Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
  • Thanks, the point that `y` becomes a local variable makes sense. I am just surprised that `dplyr` doesn't automatically clean up the environment / uses the expected scoping within `summarize`. – const-ae Oct 31 '22 at 10:25
  • To me that is like as if `f <- function(x, y){ g <- function(y){ print(attr(x, "test")); attr(x, "test") <- "test"; sum(y) } g(y) + g(2 * y) }; f("hello", 1:10)` would print "NULL" and then "test" – const-ae Oct 31 '22 at 10:27
  • 1
    @const-ae That would be a really bad way to work things. If you have a variable in your workspace called `x`, then running _any_ function that uses a temporary variable called `x` (as many do) would change the x in your workspace, whether you wanted it to or not. You would need to make sure that any function you ever ran did not use a variable name that existed in your workspace, which would be error-prone and impractical. R is a functional language, and users don't expect functions to have any side effects like this. – Allan Cameron Oct 31 '22 at 10:33
  • But that's my point. I am glad that calling `f()` prints `NULL` `NULL`. I find it odd that in dplyr I apparently need to make sure I am not accidentally reusing variable names. – const-ae Oct 31 '22 at 10:37
  • 2
    @const-ae you can think of the curly brackets inside summarize in the same way you would think of curly brackets surrounding a loop, rather than the curly brackets surrounding a function. A function has its own evaluation frame, and is sandboxed from the calling frame, though it has access to it. A loop runs the same code multiple times in the same evaluation frame (so can over-write variables), and that is what your code inside `summarize` is doing here. It's as though you believe it should be behaving as an anonymous function with access to `y`, but that's not the case. – Allan Cameron Oct 31 '22 at 10:43
  • Thanks the for loop is a very helpful example. I was indeed thinking more in terms of anonymous inner functions that are repeatedly called. Do you know why it was implemented this way? Is there any documentation on these scoping details? – const-ae Oct 31 '22 at 10:48
  • 2
    @const-ae I don't know of any documentation regarding scoping inside `summarise`, but the arguments to `summarise` are captured as _quosures_, that is, expressions with associated environments. The work is largely done inside `dplyr:::summarise_cols`, where, if you trace the logic, you will see that your code is indeed run in a loop within the same environment. – Allan Cameron Oct 31 '22 at 11:02
  • 2
    @const-ae as for 'why?', I don't know that for sure, but to me as a regular R user, it is the behaviour I would expect. One could run an anonymous function if one wanted to avoid name clashes, i.e. `z2 = (function(){attr(y, 'test') <- 'test'; sum(y)})()`, and again this gives the behaviour I would expect in R. – Allan Cameron Oct 31 '22 at 11:05
  • @Allen, thank you very much for taking the time to explain these questions to me. I have accepted your answer :) – const-ae Oct 31 '22 at 12:04