1

I'm writing a function for an R package that's similar to the one below. The function breaks data into k chunks and runs lm() on subsets of data. A dataframe named coef is then created that stores the coefficients of each output lm object, and coef is what the function returns.

func <- function(formula, data, k){
  folds <- cut(seq(1, nrow(data)), breaks=k, labels=FALSE)
  for(i in 1:k){
    tstIdx <- which(folds==i, arr.ind = TRUE)
    trn <- data[-tstIdx, ]
    assign(paste0('lm', i), lm(as.formula(formula), data = trn))
  }

  coefs <- data.frame(lm1=numeric(length(lm1$coefficients)))
  for(i in 1:k){
    coefs[, paste0('lm', i)] <- get(paste0('lm', i))$coefficients
  }

  return(coefs)
}

#Test func
library(datasets)
data(mtcars)
mtcars_coefs <- func('mpg~.', mtcars, 5)
print(mtcars_coefs)
            lm1          lm2         lm3          lm4           lm5
1  18.930505234 -11.52502902 15.33671764 34.344163557 -1.423949e+01
2  -0.026451367  -0.62542095  0.19530279 -0.983487140  1.019901e+00
3   0.006726114   0.03824514  0.01586916  0.003882512  6.283603e-05
4  -0.026141009  -0.01646497 -0.02470510 -0.010100503 -8.608105e-03
5  -0.430795818   1.04213865 -0.05029561  0.707478977  5.183456e+00
6  -2.811187445  -6.43034312 -5.97395758 -3.264676799 -5.520332e-01
7   0.684446470   2.24100765  0.96305888  0.102627465  5.468843e-01
8   1.033639000  -1.35217769 -0.14155710 -0.247260138 -1.086643e-01
9   4.674891158   2.52237260  0.37723390  0.089823364  2.841110e+00
10  0.201546058   1.10453631  1.24816558 -0.104956417  2.342505e+00
11 -0.257196875   0.49039883  0.17770208 -0.324387269 -2.453959e+00

The way I'm creating and adding coefficients to coef is by initializing it with the first lm object that is created. This works fine, but when I run Check on the package, I get the following error:

* checking R code for possible problems ... [7s] NOTE
func: no visible binding for global variable 'lm1'
Undefined global functions or variables:
  lm1

How can I either edit the code so lm1 is "visible", or how can I tell the package Check to ignore that problem?

Gaurav Bansal
  • 5,221
  • 14
  • 45
  • 91
  • 2
    There are other ways to solve this problem instead of using the assign function. Try using a list or a vector to store the result from the `lm` function. Then the results stay local to the function and you are not tampering with the global environment. – Dave2e Jul 03 '18 at 20:14
  • 1
    I concur. This is not something a package should do. Use lists, [not get/assign](https://stackoverflow.com/questions/17559390/why-is-using-assign-bad). – MrFlick Jul 03 '18 at 20:25
  • When a package starts modifying *my* environment, I promptly look for alternative packages/functions (and uninstall this one as soon as I find an acceptable one). It's "sloppy" data management, and to me it starts me doubting other aspects of the package. This may sound snotty, but [side](https://en.wikipedia.org/wiki/Side_effect_(computer_science))-[effects](https://softwareengineering.stackexchange.com/questions/40297/what-is-a-side-effect) break reproducible research and consistency but in silent and difficult-to-troubleshoot ways. – r2evans Jul 03 '18 at 20:31
  • Can you show me how to rework `func` to use lists instead of `assign`? – Gaurav Bansal Jul 03 '18 at 20:34
  • When asking for help, you should include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. Since we can't run your code we don't know for sure what it does or what the output should be. – MrFlick Jul 03 '18 at 20:42
  • I modified the question to include an example. – Gaurav Bansal Jul 03 '18 at 20:48

1 Answers1

1

Here's a way to re-write the function to just use lists.

func <- function(formula, data, k){
  folds <- cut(seq(1, nrow(data)), breaks=k, labels=FALSE)
  foldlist <- unique(folds)
  models <- lapply(foldlist, function(i) {
    tstIdx <- which(folds==i, arr.ind = TRUE)
    trn <- data[-tstIdx, ]
    lm(as.formula(formula), data = trn)
  })
  names(models) <- paste0("lm", foldlist)

  as.data.frame(sapply(models, function(m) {
    coef(m)
  }))
}

mtcars_coefs <- func('mpg~.', mtcars, 5)

In R, you just sapply/lapply over collections. No need to create named variables, just name the elements of the list.

MrFlick
  • 195,160
  • 17
  • 277
  • 295