2

I want to make many plots using multiple pairs of variables in a dataframe, all with the same x. I store the plots in a named list. For simplicity, below is an example with only 1 variable in each plot.

Key to this function is a select() call that is clearly not necessary here but is with my actual data.

The body of the function works fine on each variable, but when I loop through a list of variables, the last one in the list always produces

Error in get(ll): object 'd' not found.

(or whatever the last variable, if not 'd'). Replacing data <- df %>% select(x,ll) with data <- df avoids the error.

## make data
df2 <- data.frame(x = 1:10,
                  a = 1:10,
                  b = 2:11,
                  c = 101:110,
                  d = 10*(1:10))

## make function
testfun <- function(df = df2, vars = letters[1:4]){
  ## initialize list to store plots
  plotlist <- list()
  
  for (ll in vars){
    ## subset data
    data <- df %>% select(x, ll) ## comment out select() to get working function
    # print(data) ## uncomment to check that dataframe subset works correctly
    
    ## plot variable vs. x
    p <- ggplot(data,
           aes(x = x, y = get(ll))) +
      geom_point() +
      ylab(ll)
    
    ## add plot to named list
    plotlist[[ll]] <- p
    # print(p) ## uncomment to see that each plot is being made
  }
  return(plotlist) ## unnecessary, being explicit for troubleshooting
}

## use function
pl <- testfun(df2)
## error ?
pl

I have a work-around that avoids select() by renaming variables in my actual dataframe, but I am curious why this does not work? Any ideas?

Thomas
  • 6,515
  • 1
  • 31
  • 47
nefosl
  • 366
  • 1
  • 8
  • 1
    Use `dplyr::select`? Haven't run the code, just think it's calling the wrong `select`. Another issue may be not using `NSE`. – NelsonGon Jan 15 '22 at 15:26
  • I still get the error using `dplyr::select` (a reasonable suggestion, as plotly also has a select function), but @NelsonGon's suggestion below worked – nefosl Jan 15 '22 at 15:51

3 Answers3

2

The issue is that we cannot use get to access dplyr/tidyverse data in a "programming" paradigm. Instead, we should use non standard evaluation to access the data. I offer a simplified function below (originally I thought it was a function masking issue as I quickly skimmed the question).

testfun <- function(df = df2, vars = letters[1:4]){
 
  
  lapply(vars, function(y) {
    ggplot(df,
           aes(x = x, y = .data[[y]] )) +
      geom_point() +
      ylab(y)
    
  })


}

Calling

plots <- testfun(df2)
plots[[1]]

EDIT

Since OP would like to know what the issue is, I have used a traditional loop as requested

testfun2 <- function(df = df2, vars = letters[1:4]){
  ## initialize list to store plots
  plotlist <- list()
  
  for (ll in vars){
    ## subset data
    d_t <- df %>% select(x, ll) ## comment out select() to get working function
    # print(data) ## uncomment to check that dataframe subset works correctly
    ## plot variable vs. x
    p <- ggplot(d_t,
                aes(x = x, y = .data[[ll]])) +
      geom_point() +
      ylab(ll)
    ## add plot to named list
   plotlist[[ll]] <- p
     ## uncomment to see that each plot is being made
  }
  plotlist

}
pl <- testfun2(df2)
pl[[1]]

The reason get does not work is that we need to use non-standard evaluation as the docs state. Related questions on using get may be useful.

First plot

enter image description here

NelsonGon
  • 13,015
  • 7
  • 27
  • 57
  • 1
    I added the `dplyr::select` call to this `lapply` and all works well, both in example and with my actual data. Thanks. (But I'm still not totally sure why besides details of NSE that I don't yet understand) – nefosl Jan 15 '22 at 15:54
  • What does the error say? It's just masking. You loaded `dplyr` before some other package that has a function named `select`. You can use [conflicted](https://cran.r-project.org/web/packages/conflicted/index.html) or inspect the traceback to see where the function is coming from that errors. I personally prefer `getAnywhere` and `methods`. – NelsonGon Jan 15 '22 at 15:58
  • What is the difference between how select() works and how .data[[y]] evaluates the variable name? – nefosl Jan 15 '22 at 15:59
  • 1
    `.data[[y]]` is part of the NSE suite which includes `sym`, `!!` and the newest `{{}}`. In most cases, you cannot access data within dplyr functions inside a function unless you use on standard evaluation (NSE). `select` is meant for "top level" while `.data` and friends are more "low level" and development focused. – NelsonGon Jan 15 '22 at 16:00
  • Just to be clear, looks like this is not a masking problem: I changed this to dplyr::select in both my original for loop and in your lapply, and lapply works but my loop doesn't – nefosl Jan 15 '22 at 16:01
  • OK, I will add an option that uses a good ol loop. – NelsonGon Jan 15 '22 at 16:02
  • Could you first add what error you get @nefosl? – NelsonGon Jan 15 '22 at 16:07
  • 3
    In the OP: >Error in get(ll): object 'd' not found. – nefosl Jan 15 '22 at 16:09
  • 1
    Edited the answer, we cannot just use `get`. – NelsonGon Jan 15 '22 at 16:18
2

get() could work, but not with ll directly. Try y = get(!!ll) or y = {{ll}}.

ggplot (or maybe aes, it's hard to tell) waits to run this code until its plot object is referenced, as the error in the provided code demonstrates. By the time each ggplot evaluates get(ll), the for loop has already finished. So ll evaluates to the last value of the loop variable, "d", for all four ggplots. ll being "d" in the error makes it seem like it's the final ggplot object that fails, but it's actually evaluating the first one that causes this error.

In the body of the loop we'd like a way to evaluate the ll variable and stick that resulting string ("a", "b", "c", or "d") into this code, the rest of which won't run until later. Changing y = get(ll) to y = get(!!ll) is one way to do this: !! performs "surgery" on the unevaluated expression (called a "blueprint for code" in Tidyverse docs) so that the expression passed into ggplot contains a literal string like "a" instead of the variable reference ll.

testfun <- function(df = df2, vars = letters[1:4]){
  plotlist <- list()
  
  for (ll in vars){
    data <- df %>% select(x, ll)
    
    p <- ggplot(data,
                aes(x = x, y = get(!!ll))) +
                geom_point() +
                ylab(ll)
    
    plotlist[[ll]] <- p
  }
  return(plotlist)
}

Read on for explanation and an alternate solution.


The loop problem: late binding

In a given function or in the global scope in R, there's just one variable of any given name. A for (x in xs) loop repeatedly rebinds that variable to a new value. That means that after a for loop has finished, that variable still exists and retains the last value it was assigned. Here's a way this can trip you up:

vars <- c("a", "b", "c", "d")

results <- list()

for (ll in vars){
  message("in for loop, ll: ", ll)
  func <- function () { ll }
  results[[ll]] <- c(ll, func)
}
message("after for loop, ll: ", ll)
# after for loop, now ll is "d"

for (vec in results) {
  message(vec[[1]], " ", vec[[2]]())
}

This outputs

in for loop, ll: a
in for loop, ll: b
in for loop, ll: c
in for loop, ll: d
after for loop, ll: d
a d
b d
c d
d d

Each of the four functions constructed here use the same outer scope variable ll which, by the time the functions are actually called after the for loop, is "d". The late binding part is that the value of the variable at function call time (late) is used when looking up its value, not the value of the variable when the function is defined (early).

The NSE problem

The OP isn't creating functions in a loop though, they're calling ggplot. ggplot does something similar to creating a function: it takes some code as an argument that it doesn't evaluate until later. ggplot (or maybe aes) "captures" code from some of arguments instead of running them. In OP's case, get(ll) isn't evaluated until later.

When this code is evaluated it's in a new context with a "data mask" that allows names of a data frame to be referenced directly. This part is great, it's what we want — this is what makes get("a") work at all. But the fact that the evaluation happens later is a problem for the OP: ll in get(ll) evaluated to "d", like get("d"), because the code is evaluated after the for-loop iteration where ll had the expected value.

Ignoring the data mask part, here's a function called run.later that, like ggplot, doesn't run one of its arguments. When we run that code later, we again find that ll evaluates to "d" for all four of the saved expressions.

vars <- c("a", "b", "c", "d")

unevaluated.exprs <- list();

run.later <- function(name, something) {
  expr <- substitute(something)
  unevaluated.exprs[[name]] <<- c(name, expr)
}

for (ll in vars){
  run.later(ll, ll)
}

for (vec in unevaluated.exprs) {
  message(c(vec[[1]], " ", eval(vec[[2]])))
}

prints

a d
b d
c d
d d

That's the ll part of the problem. The rule of thumb from languages like Python of "Don't define functions in a loop (if they reference loop variables)" could be generalized for R to "don't define functions or otherwise write code that won't be immediately evaluated in a loop (if that code references loop variables)."


Fixing the scope problem instead of metaprogramming

The !! solution provided at the top uses metaprogramming to evaluate the ll variable in the loop instead of evaluating it later.

Theoretically, one could instead dynamically create variables in each iteration of a loop, then carefully reference that dynamically created variable name with metaprogramming. But a more elegant way would be to use the same variable name but in different scopes. This is what Nithin's answer does with a function: every function creates a new scope and tada, you can use the same variable name in each. Here's another version of that, closer to OP's code:

testfun <- function(df = df2, vars = letters[1:4]){
  plotlist <- list()

  plot.fn <- function(var) {
      data <- df %>% select(x, var)
      p <- ggplot(data,
          aes(x = x, y = get(var))) +
          geom_point() +
          ylab(var)
      plotlist[[ll]] <<- p
  }
  
  for (ll in vars){
    plot.fn(ll)
  }
  return(plotlist)
}

pl <- testfun(df2)
pl

There are 4 distinct variables called var in this code, and each iteration of the loop references a different one.


Prettier metaprogramming

I think (haven't tested) that get(!!ll) is equivalent to {{ll}} here — get() looks up a string as a variable, but that's also what sticking the symbol of the string that ll evaluates to into the expression does. Double curlies seem more common and can roughly be understood as "evaluate the result of this expression as a variable in the other context," or as "template this string into the expression."

Thomas
  • 6,515
  • 1
  • 31
  • 47
1

write a custom function like this

plot_fn<- function(df,y){
  df %>% ggplot(aes(x=x, 
                    y=get(y))+
          geom_point()+
          ylab(y)
    }

Iterate over plots with purrr:::map

map(letters[1:4],~plot_fn(df=df2,y=.x))
Nithin .M
  • 85
  • 5