0

I want to train a model via tidymodels using predictions from another model as feature. Specifically it`s a KNN model where I want to use predictions from a random forest model as a feature.

I started implementing a (hacky) solution using step_mutate, here it is:

library(dplyr)
library(tidymodels)
library(purrr)
library(data.table)

df <- data.table(
  y = rnorm(100), x1 = rnorm(100), x2 = rnorm(100), x3 = rnorm(100)
)

pred_rf <- function(...) {
  # Very hacky function which creating random_forest predictions 
  nms <- purrr::map_chr(rlang::enexprs(...), as.character)
  l <- list(...)
  dat <- setDT(l)

  outcome <- names(dat)[1]
  preds <- names(dat)[-1]

  rec <- recipe(dat) %>%
    update_role(!!outcome, new_role = "outcome") %>%
    update_role(!!preds, new_role = "predictor")

  model <- rand_forest(mode = "regression")
  wf <- workflow() %>%
    add_recipe(rec) %>%
    add_model(model)

  fitted_model <- fit(wf, dat)
  predictions <- predict(fitted_model, dat)$.pred
  stopifnot(length(predictions) == nrow(dat))
  stopifnot(sum(is.na(predictions)) == 0)

  return(predictions)
}

rec <- recipe(y ~ ., df) %>%
  step_mutate(y_pred = pred_rf(y, x1, x2)) %>% 
  prep()

bake(rec, new_data = NULL) # Desired output would be a design matrix like this

However I realised that would cause data-leakage when used for tuning. Is this possible to do without data leakage or would I need to create a custom step? It would be very similar to the step_impute_* functions, but I couldn`t find anything.

Thanks

bartleby
  • 107
  • 1
  • 5
  • It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick Feb 09 '23 at 14:33

1 Answers1

0

The comment about data leakage is spot on. That is a huge concern (esp for a random forest model). This isn't an issue for imputation since the outcome variable for the original and imputation model are different.

We suggest re-framing the problem by including the original and KNN models (and others) in a stacking ensemble. That way, your other models can affect the outcome but are not inside another model. That may not be what you want, but I don't see any way to get there without significant overfitting.

As a side note, step_mutate() wouldn't work since the model doesn't persist. You would have to emulate the imputation steps to make sure that new samples can be processed with the recipe. The PLS and class distance steps are also good examples to emulate.

topepo
  • 13,534
  • 3
  • 39
  • 52
  • Thanks! In my situation I actually have to use a KNN model so stacking wouldn`t be an option. I guess what I could do as a last resort is to write a custom step_ function, right? – bartleby Feb 09 '23 at 16:44