2

I am attempting to build a general framework for quickly evaluating a variety of models. I am trying to use a factory pattern to generate "model trainer" functions that take a data frame and return a trained model. However, I am running into unexpected behavior of R's built-in lm function within this framework.

gen_lm_model_trainer <- function(formula, weights_col = NULL) {
  function(train_data) {
    trained_lm <- lm(formula = formula,
                     data = train_data,
                     weights = train_data[[weights_col]])

    pred_func <- function(test_data) {
      prediction <- predict(trained_lm, newdata = test_data)
      return(prediction)
    }

    return(list(predict = pred_func, info = trained_lm))
  }
}

mtcars$random_weights <- rbeta(nrow(mtcars), shape1 = 5, shape2 = 2)
trainer <- gen_lm_model_trainer(formula = mpg ~ ., weights_col = 'random_weights')
trained_model <- trainer(mtcars)

The response to this code is the following:

Error in eval(extras, data, env) : object 'train_data' not found

This is similar another SO question, Object not found error when passing model formula to another function, but this problem is not solved by assigning the formula's environment to the generated function's environment, i.e.

gen_lm_model_trainer <- function(formula, weights_col = NULL) {
  function(train_data) {
    scoped_formula <- as.formula(formula, env = environment())
    trained_lm <- lm(formula = scoped_formula,
                     data = train_data,
                     weights = train_data[[weights_col]])

    pred_func <- function(test_data) {
      prediction <- predict(trained_lm, newdata = test_data)
      return(prediction)
    }

    return(list(predict = pred_func, info = trained_lm))
  }
}

A solution that works consistently for both problems would be most appreciated.

John Haberstroh
  • 465
  • 4
  • 11

3 Answers3

1

I have found a partial answer to the question -- partial in that it solves only this case and not the linked SO question. The problem seems to be that lm's arguments are being evaluated in the environment that corresponds to calling with(train_data, lm(...)). It should therefore be safe to use parent.frame() to traverse to the environment of the calling function (the "model trainer"). This happens to correspond to a depth of n = 1 -- in this case, I think that n = 1 is the data frame's environment, n = 2 is eval's environment, and n = 3 is the environment from which lm is being called.

gen_lm_model_trainer <- function(formula, weights_col = NULL) {
  function(train_data) {
    trained_lm <- lm(formula = formula,
                     data = train_data,
                     weights = get('train_data', parent.frame(3))[[get('weights_col', parent.frame(3))]])

    pred_func <- function(test_data) {
      prediction <- predict(trained_lm, newdata = test_data)
      return(prediction)
    }

    return(list(predict = pred_func, info = trained_lm))
  }
}

mtcars$random_weights <- rbeta(nrow(mtcars), shape1 = 5, shape2 = 2)
trainer <- gen_lm_model_trainer(formula = mpg ~ ., weights_col = 'random_weights')
trained_model <- trainer(mtcars)

Why lm changes scope so unusually is unclear to me and seems like a bug.

John Haberstroh
  • 465
  • 4
  • 11
  • 2
    I think if you look carefully, you'll discover that the method you've used is actually not successfully changing the environment of the formula. It works for me fine if I do `environment(formula) <- environment()` instead and just pass `formula` to `lm`. The method you used may only work when the formula begins as a character. – joran Dec 21 '18 at 22:27
  • Do you want to add this as an answer? I checked it and it seems like the best solution since it solves both problems. – John Haberstroh Dec 21 '18 at 22:47
0

For interesting reasons adding

random_weights <- train_data[[weights_col]]

or, more generically

assign(weights_col, train_data[[weights_col]])

to the beginning of your function(train_data) { and passing the random_weights as weights to lm will fix this, resulting function looks like this:

gen_lm_model_trainer <- function(formula, weights_col = NULL) {
  function(train_data) {
    assign(weights_col, train_data[[weights_col]])
    trained_lm <- lm(formula = formula, data = train_data, weights = random_weights)

    pred_func <- function(test_data) {
      prediction <- predict(trained_lm, newdata = test_data)
      return(prediction)
    }

    return(list(predict = pred_func, info = trained_lm))
  }
}

mtcars$random_weights <- rbeta(nrow(mtcars), shape1 = 5, shape2 = 2)
trainer <- local(gen_lm_model_trainer(formula = mpg ~ ., weights_col = 'random_weights'))
trained_model <- trainer(mtcars)

Reasoning:

The underlying reason is that the weights are passed to stats::model.frame.default as part of ... and that gets evaluated separately:

  env <- environment(formula)
  # ...
  # more code
  # ...
  extras <- substitute(list(...))
  extranames <- names(extras[-1L])
  extras <- eval(extras, data, env)
Jozef
  • 2,617
  • 14
  • 19
  • 2
    I think it is also sufficient to, instead of the `scoped_formula` attempt, replace that line with `environment(formula) <- environment()`, and just pass `formula` to `lm`. – joran Dec 21 '18 at 22:28
  • @joran, indeed, I also think this is the most elegant solution, I would suggest making it an answer. – Jozef Dec 21 '18 at 22:53
0

The following worked for me:

gen_lm_model_trainer <- function(formula, weights_col = NULL) {
  function(train_data, .fml = formula, .wts = weights_col) {
    w <- train_data[[.wts]]
    environment(.fml) <- environment()
    trained_lm <- lm(formula = .fml,
                     data = train_data,
                     weights = w)
    pred_func <- function(test_data) {
      predict(trained_lm, newdata = test_data)
    }
  list(predict = pred_func, info = trained_lm)
  }
}

mtcars$random_weights <- rbeta(nrow(mtcars), shape1 = 5, shape2 = 2)
trainer <- gen_lm_model_trainer(formula = mpg ~ ., weights_col = 'random_weights')
trained_model <- trainer(mtcars)

I may have made some cosmetic changes but there are just two real changes:

1) environment(.fml) <- environment() # to make sure that the object within the function's scope are accessible # otherwise it won't find the weights thing but curiously, it can find the data

2) Passing the formula and weights column names as arguments.

I can't quite explain why this combination works ... it's an interesting case. I've generated lm models with a different approach and there's always trouble with it.

lebatsnok
  • 6,329
  • 2
  • 21
  • 22
  • If interested, I tried to explain why it can find the data but not the weights below - https://stackoverflow.com/a/53891420/8876652 - weights are passed to `stats::model.frame.default` as part of `...` that gets different treatment. – Jozef Dec 21 '18 at 22:57