0

Suppose we have the following data:

# simulate data to fit
set.seed(21)
y = rnorm(100)
x = .5*y + rnorm(100, 0, sqrt(.75))

Let's also suppose the user has fit a model:

# user fits a lm
mod = lm(y~x)

Now suppose I have an R package designed to perform several operations on the object mod. Just for simplicify, suppose we have two functions, one that plots the data, and one that computes the coefficients. However, as an intermediary, suppose we want to perform some operation on the data (in this example, add ten).

Example:

# function that adds ten to all scores
add_ten = function(model) {
  data = model$model
  data = data + 10
  return(data)
}

# functions I defined that do something to the "add_ten" dataset
plot_ten = function(model) {
  new_data = data.frame(add_ten(model))
  x = all.vars(formula(model))[2]
  y = all.vars(formula(model))[1]
  ggplot2::ggplot(new_data, aes_string(x=x, y=y)) + geom_point() + geom_smooth()
}

coefs_ten = function(model) {
  new_data = data.frame(add_ten(model))
  coef(lm(formula(model), new_data))
}

(Obviously, this is pretty silly to do. In actuality, the operation I want to perform is multiple imputation, which is computationally intensive).

Notice in the above example I have to call the add_ten function twice, once for plot_ten and once for coefs_ten. This is inefficient.

So, now to my question, what is the best way to create a reusable object within a function?

I could, of course, create an object to be placed in the user's global environment:

add_ten = function(model) {
  # check for add_ten_data in the global environment
  if (exists("add_ten_data", where = .GlobalEnv)) return(get("add_ten_data", envir = .GlobalEnv))
  data = model$model
  data = data + 10
  # assign add_ten_data to the global environment
  assign('add_ten_data', data, envir = .GlobalEnv)
  return(data)
}

I'm happy to do so, but worry about the "netiquette" of putting something in the user's environment. There's also a potential problem if users happen to have an object called "add_ten_data" in their environment.

So, what is the best way of accomplishing this?

Thanks in advance!

dfife
  • 348
  • 1
  • 12
  • Call `add_ten()` function once, and then pass that result into the `plot_ten` and `coefs_ten` functions. Functions should not create global variables and it's not a good idea for functions to assume that certain global variables exist. – MrFlick Dec 01 '20 at 18:27
  • 1
    Agreed. It's generally considered best practice for functions to be self-contained. Inputs are passed in, and results are returned. This prevents strange behavior and makes it easier to update code later. So move `new_data = data.frame(add_ten(model))` outside of your functions. Then run the functions by passing in `new_data` instead of passing in `model`. If you want you can remove new_data when you're done with it. – Adam Sampson Dec 01 '20 at 18:30
  • Thanks for the comments. I was hoping to avoid an additional step on the user's end, but it seems that may be unavoidable without violating best practices. – dfife Dec 01 '20 at 19:28

1 Answers1

2

You should certainly avoid writing an object to the global environment. If you find that you have to repeat the same computationally expensive task at the top of a number of different functions, it means you are carrying out the computationally expensive task too late.

For example, you could create an S3 class that holds the necessary components to produce a "cheap" plot and a "cheap" extraction of the coefficients. It even has the benefits of generic dispatch:

add_ten <- function(model) model$model + 10

lm_tens <- function(formula, data)
{
  model <- if(missing(data)) lm(formula) else lm(formula, data = data)
  
  structure(list(data = data.frame(add_ten(model)), model = model),
            class = "tens")
}

plot.tens <- function(tens) {
  x = all.vars(formula(tens$data))[2]
  y = all.vars(formula(tens$data))[1]
  ggplot2::ggplot(tens$data, ggplot2::aes(x = x, y = y)) + 
    ggplot2::geom_point() + 
    ggplot2::geom_smooth()
}

coef.tens = function(tens) {
  coef(lm(formula(tens$model), data = tens$data))
}

So now we just need to do:

set.seed(21)
y = rnorm(100)
x = .5*y + rnorm(100, 0, sqrt(.75))

mod <- lm_tens(y ~ x)
coef(mod)
#> (Intercept)           x 
#>   4.3269914   0.5775404
plot(mod)
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Note that we only need to call add_ten once here.

Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
  • I've seen functions do something like this (e.g., the mice package and the norm package) and always found the two-stage process a little frustrating. But, i think a good alternative is similar to what you propose: do not *require* the lm_tens function, but use it if they've called it (otherwise, repeat add_ten). – dfife Dec 01 '20 at 19:27
  • @dfife it depends what possible uses you have for the object. I came across an example today from the `eulerr` package. The `euler` class contains a few different lightweight fields that made it convenient to bundle them up into a class, but the most of the work is done later; the `plot.euler` function is expensive, so the structures needed to draw the plot are only generated when `plot` is called. On the other hand, most regression functions do the computationally expensive part at the outset and you can pass the model around knowing that it's going to be cheap to do any work on it later. – Allan Cameron Dec 01 '20 at 19:34
  • Let me give a bit more detail (just in case I'm overlooking some obvious solution). My package accepts models fitted from the lavaan package. (lavaan is NOT my package). Some of my functions require computing standard errors (which are estimated through mutliple imputation, which is computationally intensive). I can't attach these standard errors to the already-estimated lavaan model, so I have been computing them for each function. But, the same standard error computations might happen multiple times for different functions. Hence, the question about placing them in the global environment :) – dfife Dec 01 '20 at 19:39
  • @dfife so why not create a class that wraps the lavaan object and holds it as a member, but also holds the standard errors? Say your class is called "dfife_class" and contains a member "model" which is a lavaan object, and a member "SE" which has the computed standard errors. At the head of each function check whether it has been passed a lavaan object or a "dfife_class" object. If it's a lavaan object, calculate the SE and turn the lavaan into a "dfife_class" object. Then write your function to handle "dfife_class" objects – Allan Cameron Dec 01 '20 at 20:25
  • I see what you're saying. Yes, that's a good idea and will save an extra step (assuming users fit with lm_tens instead of lm). Thanks! – dfife Dec 01 '20 at 21:18