how does parsnip know how to match `fit` arguments to function arguments for a model?

Question

I am trying to create a new model for the parsnip package from an existing modeling function foo.

I have followed the tutorial in building new models in parsnip and followed the README on Github, but I still cannot figure out some things.

How does the fit function in parsnip know how to assign its input data (e.g. a matrix) to my idiosyncratic function call?

Imagine if there was an idiosyncratic model function foo where the conventional roles of x and y arguments were reversed: i.e. foo(x,y) where x should be an outcome vector and y should be a predictor matrix, bizarrely.

For example: suppose a is a matrix of predictors and b is a vector of outcomes. Then I call fit_xy(object=my_model, x=a, y=b). Internally, how does fit_xy() know to call foo(x=y,y=x) ?

score 3 · Answer 1 · answered Jul 13 '21 at 22:31

The function to validate the input is check_final_param, which require that each argument e.g. have to be named. That is why order is not important. https://github.com/tidymodels/parsnip/blob/f7ba069671684f61af0ca1eadb1927fedec8a9c6/R/misc.R#L235

The README file linked by you pointing out: "To create the model fit call, the protect arguments are populated with the appropriate objects (usually from the data set), and rlang::call2 is used to create a call that can be executed. "

Example of randomForest which using ntree instead of default trees argument. They created a translation calls which will be used during evaluation. https://github.com/tidymodels/parsnip/blob/228a6dc6975fc91562b63d191e43d2164cc78e3d/R/rand_forest_data.R#L339

If we use call2 and unpack the named args the order does not matter. And as we know that args will be properly named because of additional translation step.

args <- list(na.rm = TRUE, trim = 0)

rlang::call2("mean", 1:10, !!!args)

But this does not explain how the *main* arguments are matched. "To create the model fit call, the protect arguments are populated with the appropriate objects (usually from the data set)", but nowhere is it specified how to match arguments `x` and `y` in `fit_xy()` to some idiosyncratic function `foo2(p,q)`. How does it know to assign `x` to `p` and `y` to `q` instead of the other way around? I tell it how to assign *auxiliary* arguments to `foo2` via `set_args`, but `set_args` does not cover the "main" arguments (predictor matrix and outcome vector). — cmo, Jul 14 '21 at 08:08

score 1 · Accepted Answer · answered Jul 30 '21 at 23:56

The way we do this is through the set_fit() function. Most models are pretty sensible and we can use default mappings (for example, from data argument to data argument or x to x) but you are right that some models use different norms. An example of this are the Spark models that use x to mean what we might normally call data with a formula method.

The random forest set_fit() function for Spark looks like this:

set_fit(
  model = "rand_forest",
  eng = "spark",
  mode = "classification",
  value = list(
    interface = "formula",
    data = c(formula = "formula", data = "x"),
    protect = c("x", "formula", "type"),
    func = c(pkg = "sparklyr", fun = "ml_random_forest"),
    defaults = list(seed = expr(sample.int(10 ^ 5, 1)))
  )
)

Notice especially the data element of the value argument. You can read a bit more here.

thank you @JuliaSilge, that `data` field is what i was looking for. However, the `data` field of `value` is not documented in the linked page, nor on the vignette for building a parsnip model from scratch. — cmo, Oct 05 '21 at 17:21

how does parsnip know how to match `fit` arguments to function arguments for a model?

2 Answers2