What does it mean by formulas and closures being able to "capture the enclosing environment" in R?

Question

Quoting from Hadley Wickham's Advanced R text-book,

f1 <- function() {
x <- 1:1e6
10
}

pryr::mem_change(x <- f1())
#> 1.43 kB
pryr::object_size(x)
#> 48 B

f2 <- function() {
x <- 1:1e6
a ~ b
}
pryr::mem_change(y <- f2())
#> 4 MB
pryr::object_size(y)
#> 4 MB

f3 <- function() {
  x <- 1:1e6
  function() 10
}
pryr::mem_change(z <- f3())
#> 4 MB
pryr::object_size(z)
#> 4.01 MB

In f1(), 1:1e6 is only referenced inside the function, so when the function completes the memory is returned and the net memory change is 0. f2() and f3() both return objects that capture environments, so that x is not freed when the function completes.

I am vaguely able to understand the concept of a closure capturing the environment. But how is the ~ doing so? And also, what exactly does it mean by "capturing the environment" in this context?

score 2 · Accepted Answer · answered Jun 05 '15 at 06:09

The tilde-operator is actually an infix function (as are the usual mathematical operators), so it captures its environment when "activated" or invoked.

> is.function(`~`)
[1] TRUE
> myform <- a ~ b
> length(myform)
[1] 3
> myform[[1]]
`~`
> myform[[2]]
a
> myform[[3]]
b

> environment(myform)
<environment: R_GlobalEnv>

score 1 · Answer 2 · edited May 23 '17 at 10:27

In R, an environment is simply a data object that contains other data in the form of key/value pairs; basically it's a hash table.

There are many built-in environments in R, including the global environment globalenv(), the empty environment emptyenv(), the public and private environments of packages, the Autoload environment, and others. Additionally, there are environments that are generated on-the-fly for function evaluations. In other words, when you evaluate a function, an environment is generated dynamically for the purpose of holding all local variables created during that function evaluation. This environment is generally referred to as an evaluation environment, and it is specific to that particular evaluation of that particular function.

It is also possible to create your own environments via new.env(), which are obviously not built-in environments and are not associated with any function evaluation; they are independent, user-defined environments.

When a new function is defined in the course of a function evaluation, it captures the evaluation environment inside which that definition took place. In this context, and with respect to that dynamically-defined function, this environment can be referred to as the closure environment or the enclosing environment of that function. The dynamically-defined function itself can be referred to as a closure.

The process is the same with formulae. When a formula is defined in the course of a function evaluation, it captures the evaluation environment inside which that definition took place, and you can use the same terminology here, although perhaps closure function and closure formula would be useful specifics to disambiguate the unqualified term closure if the context doesn't make it clear.

If a function or formula is defined in global scope, it still closures around an environment, that environment just happens to be the global environment. This is unusual among programming languages that support closuring; for example, in Perl, if an anonymous subroutine does not reference any non-local variables in its body, then it does not become a closure; see http://apache.perl.org/docs/general/perl_reference/perl_reference.html#Understanding_Closures____the_Easy_Way. Because of this fact, technically every function and formula in R is a closure, because they all closure around an environment. Thus it is admittedly redundant to use the terms closure function and closure formula when talking about R, but it can still be useful to highlight the "closureness" of these data objects. (Actually there is one semi-exception to this rule: .Primitive() functions such as if, while, return, function, <-, [, $, @, *, &, :, sum(), etc. do not have enclosing environments, but that is not really an exception, as their code is implemented in C and is compiled into the R executable; thus, they cannot be closures anyway.)

There is one notable difference, however, between closure functions and closure formulae: For closure formulae, the closure environment is captured on an attribute named .Environment on the formula object, accessible via attr()/attributes(). For closure functions, the closure environment is not an attribute, but rather comprises one of the three fundamental properties of the function, the other two being the body and the parameters. These three properties of functions can only be accessed via the (ultimately .Internal()) functions environment(), body(), and formals(). And it should also be mentioned that environment() can also be used to access the closure environment on closure formulae, even though they're also accessible as normal attributes. (And, when invoked with no arguments, it can also be used to return the current evaluation environment in function scope, or the global environment at global scope; speaking generally, environment() is a very versatile function!)

Lastly, an important concept is the parent environment chain. Not only do functions and formulae reference environments, but environments reference environments. Every environment references exactly one environment, just as every closure references exactly one environment. In the context of an environment referencing another environment, the standard terminology is to say that the referenced environment is the parent environment of the referencing environment (and therefore you theoretically might say that the latter is a child environment of the former, although there can be multiple children, and I haven't seen this term used anywhere; no one ever navigates the chain downwards.) You can use parent.env() to get hold of the parent environment of a given environment.

Which environment is used as the parent of a given environment depends on which environment we're talking about, but the most important type of environment here is an evaluation environment. As explained earlier, when a function is executed, an evaluation environment is created for that particular evaluation of that particular function. At that time, the parent environment of the new evaluation environment is hooked up to the closure environment of the function itself.

The ability of environments to reference environments obviously introduces the possibility for multiple environments to be hooked up to each other in a chain, and this is exactly what happens in R. Actually, it allows for the formation of a more complicated structure, a directed graph. Technically, you could create any number of directed graphs of environments you want, by creating environments with new.env() and assigning their parent environments appropriately. You can even create circular graphs. Out of curiosity to see what would happen, I just created a "circle of environments", hooked up a function's closure environment to one of the environments in the circle, and executed the function, whose body tried to use the superassignment operator <<- to commence a target search for the lvalue (more on this later). It caused an infinite loop and busted my R session. Don't do this!

But aside from user-defined environments and graphs, there is a fundamental structure of environments in R that is built into the core of the language and is repeatedly used during normal execution. Since this structure is nearly always followed along a line, it makes sense to just refer to it as a chain, and ignore the more complicated structure. Below I attempt to explain exactly which environments form this important parent environment chain.

The global environment sits at an important nexus in the chain; you might say it sits at the "center" of the chain. Behind the global environment are the public environments of all packages you have loaded in your session, plus the Autoload environment just in front of the base environment, and finally the chain is terminated by the empty environment. You can examine this segment of the chain with search() (and, as I'll demonstrate later, parent.env()):

search();
## [1] ".GlobalEnv"        "package:stats"     "package:graphics"  "package:grDevices" "package:utils"     "package:datasets"  "package:methods"   "Autoloads"         "package:base"

In front of the global environment is nothing by default. But when you start to dynamically define functions/formulae inside the evaluation environments of other functions, then the evaluation environments of those enclosing functions will be closured by the nested functions/formulae, and those closured evaluation environments will reference their enclosing environments, or the global environment if the evaluation is of a function that was defined at global scope. This technically forms a tree of environments, but all roads lead back to the global environment, which then leads to the search path as shown above.

In any lexical scope, any attempt to reference a variable as an rvalue, and any attempt to use the superassignment <<- operator to assign a variable as an lvalue, will initiate what I like to call a "target search" through the chain of environments to look for the first environment that has a key that matches that variable's name. For an rvalue usage, the current value of that variable is substituted, and for an lvalue usage, the value of the variable is replaced with the return value of the RHS of the assignment. If undefined, the target search for an rvalue will move all the way back through the closure side of the chain, through the global environment, and through the entire search path until it hits the empty environment, at which point you'd get the classic Error: object 'whatever' not found error message. If undefined, the target search for an lvalue will also move all the way back through the closure side of the chain, through the global environment, and through the entire search path, at which point a new variable will be defined in the global environment with the return value of the RHS as its value. I find it somewhat strange that a superassignment target search does indeed pass through the entire search path, despite the fact that the bindings in loaded package environments are generally locked. This causes superassignments to variable names that clash with existing variables in any public package environment to fail. For example, if you run c <<- 3;, you get Error: cannot change value of locked binding for 'c'. Now, I have verified that it is possible to interpose your own user-defined environment into the search path (e.g. e1 <- new.env(); parent.env(e1) <- baseenv(); parent.env(.AutoloadEnv) <- e1; e1$v1 <- 3;), in which case superassignments to variable names that are already defined in that environment succeed (e.g. v1 <<- 4;), but, generally speaking, no one ever does this (nor should they), so the utility of it is non-existent. Perhaps this is another good reason to avoid using the superassignment operator (the first reason being that it creates what is usually an unnecessary side effect from a function call, opposing the functional paradigm of program design).

Finally, for the sake of completion, the local assignment <- operator never initiates a target search; it always assigns to the corresponding key/value pair in the immediate evaluation environment. The rightwards assignment operators (->> and ->) behave just as their leftwards friends do, only with the LHS and RHS meanings reversed.

I've written a piece of code to try to demonstrate closures, along with the important closure-related functions environment(), globalenv(), and parent.env(), along with the assignment <- and superassignment <<- operators. I originally wrote it for this answer, but here I'll extend it to also demonstrate closure formulae. In the following code, every assignment whose RHS is an environment assigns the actual environment that contains the variable being assigned to. I also assign three formulae (for these the RHS cannot be environments, because they have to be formulae!) and reveal their closure environments at the end of the code:

oldGlobal <- environment(); ## environment() is same as globalenv() in global scope
f0 <- ~.;
(function() {
    newLocal1 <- environment(); ## creates a new local variable in this function evaluation's evaluation environment
    print(newLocal1); ## <environment: 0x6008e2fe8> (different for every evaluation)
    oldGlobal <<- parent.env(environment()); ## target search hits oldGlobal in closure environment; RHS is same as globalenv()
    newGlobal1 <<- globalenv(); ## target search fails; creates a new variable in the global environment
    f1 <<- ~.;
    (function() {
        newLocal2 <- environment(); ## creates a new local variable in this function evaluation's evaluation environment
        print(newLocal2); ## <environment: 0x600874968> (different for every evaluation)
        newLocal1 <<- parent.env(environment()); ## target search hits the existing newLocal1 in closure environment
        print(newLocal1); ## same value that was already in newLocal1
        oldGlobal <<- parent.env(parent.env(environment())); ## target search hits oldGlobal two closure environments up in the chain; RHS is same as globalenv()
        newGlobal2 <<- globalenv(); ## target search fails; creates a new variable in the global environment
        f2 <<- ~.;
    })();
})();
oldGlobal; ## <environment: R_GlobalEnv>
newGlobal1; ## <environment: R_GlobalEnv>
newGlobal2; ## <environment: R_GlobalEnv>
environment(f0); ## <environment: R_GlobalEnv>
environment(f1); ## <environment: 0x6008e2fe8>
environment(f2); ## <environment: 0x600874968>

As you can see there are three environments that are relevant here: (1) the global environment R_GlobalEnv, (2) the first-level evaluation environment 0x6008e2fe8, whose parent environment is the global environment, and (3) the second-level evaluation environment 0x600874968, whose parent environment is the first-level evaluation environment. Below I summarize which variables refer to which environments:

oldGlobal    assigned to the global environment R_GlobalEnv
newGlobal1   assigned to the global environment R_GlobalEnv
newGlobal2   assigned to the global environment R_GlobalEnv
f0           closured around the global environment R_GlobalEnv
newLocal1    assigned to the first-level evaluation environment 0x6008e2fe8
f1           closured around the first-level evaluation environment 0x6008e2fe8
newLocal2    assigned to the second-level evaluation environment 0x600874968
f2           closured around the second-level evaluation environment 0x600874968

Finally, it is instructive to follow the parent environment chain all the way back from the second-level evaluation environment to the empty environment, illuminating exactly how the target search is done. We can utilize f2 for this, as it closured around the second-level evaluation environment:

environment(f2);
## <environment: 0x600874968>
parent.env(environment(f2));
## <environment: 0x6008e2fe8>
parent.env(parent.env(environment(f2)));
## <environment: R_GlobalEnv>
parent.env(parent.env(parent.env(environment(f2))));
## <environment: package:stats>
## attr(,"name")
## [1] "package:stats"
## attr(,"path")
## [1] "/usr/lib/R/library/stats"
Reduce(function(a,b) b(a),c(environment(f2),replicate(4,parent.env)));
## <environment: package:graphics>
## attr(,"name")
## [1] "package:graphics"
## attr(,"path")
## [1] "/usr/lib/R/library/graphics"
Reduce(function(a,b) b(a),c(environment(f2),replicate(5,parent.env)));
## <environment: package:grDevices>
## attr(,"name")
## [1] "package:grDevices"
## attr(,"path")
## [1] "/usr/lib/R/library/grDevices"
Reduce(function(a,b) b(a),c(environment(f2),replicate(6,parent.env)));
## <environment: package:utils>
## attr(,"name")
## [1] "package:utils"
## attr(,"path")
## [1] "/usr/lib/R/library/utils"
Reduce(function(a,b) b(a),c(environment(f2),replicate(7,parent.env)));
## <environment: package:datasets>
## attr(,"name")
## [1] "package:datasets"
## attr(,"path")
## [1] "/usr/lib/R/library/datasets"
Reduce(function(a,b) b(a),c(environment(f2),replicate(8,parent.env)));
## <environment: package:methods>
## attr(,"name")
## [1] "package:methods"
## attr(,"path")
## [1] "/usr/lib/R/library/methods"
Reduce(function(a,b) b(a),c(environment(f2),replicate(9,parent.env)));
## <environment: 0x60019a7f8>
## attr(,"name")
## [1] "Autoloads"
Reduce(function(a,b) b(a),c(environment(f2),replicate(10,parent.env)));
## <environment: base>
Reduce(function(a,b) b(a),c(environment(f2),replicate(11,parent.env)));
## <environment: R_EmptyEnv>
Reduce(function(a,b) b(a),c(environment(f2),replicate(12,parent.env)));
## Error in b(a) : the empty environment has no parent

What does it mean by formulas and closures being able to "capture the enclosing environment" in R?

2 Answers2

Linked