In order to build a stacking model, I trained many base models using different pretreatments on the same dataset. In order to keep track of the way to build the design matrices I used the recipes package and defined my own steps. But using a recipe with a custom step into a caret training model revealed to be 20x slower than applying the same pretreatment and training the caret model with the handmade design matrix. Any idea why and how to improve this?
I provide a reproducible example below:
# Loading libraries
packs <- c("tidyverse", "caret", "e1071", "wavelets", "recipes")
InstIfNec<-function (pack) {
if (!do.call(require,as.list(pack))) {
do.call(install.packages,as.list(pack)) }
do.call(require,as.list(pack)) }
lapply(packs, InstIfNec)
# Getting data
data(biomass)
biomass <- select(biomass,-dataset,-sample)
# Defining custom pretreatment algorithm
HaarTransform <- function(DF1) {
w <- function(k) {
s1 = dwt(k, filter = "haar")
return (s1@V[[1]])
}
Smt = as.matrix(DF1)
Smt = t(base::apply(Smt, 1, w))
return (data.frame(Smt))
}
# Creating the custom step function
step_Haar_new <- function(terms, role, trained, skip, columns, id) {
step(subclass = "Haar", terms = terms, role = role,
trained = trained, skip = skip, columns = columns, id = id)
}
step_Haar<-function(recipe, ..., role="predictor", trained=FALSE, skip=FALSE,
columns=NULL, id=rand_id("Harr")) {
terms=ellipse_check(...)
add_step(recipe, step_Haar_new(terms=terms, role=role, trained=trained,
skip=skip, columns=columns, id=id))
}
prep.step_Haar <- function(x, training, info = NULL, ...) {
col_names <- terms_select(terms = x$terms, info = info)
step_Haar_new(terms = x$terms, role = x$role, trained = TRUE,
skip = x$skip, columns = col_names, id = x$id)
}
bake.step_Haar <- function(object, new_data, ...) {
predictors <- HaarTransform(dplyr::select(new_data, object$columns))
new_data[, object$columns] <- NULL
bind_cols(new_data, predictors)
}
# Fiting the caret model using recipe
system.time({
Haar_recipe<-recipe(carbon ~ ., biomass) %>%
step_Haar(all_predictors())
set.seed(1)
fit <- caret::train(Haar_recipe, data = biomass, method = "svmLinear")
})
# Fiting the caret model with hand made pretreatment
system.time({
df<-HaarTransform(biomass[,-1])
set.seed(1)
fit2<-caret::train(x=df, y=biomass[, 1], method="svmLinear")
})
# Comparing results
fit; fit2
# Both way provide the same result but the recipes way take ~20 seconds while hand made pretreatment take ~1.5 seconds
Using profvis, it looks like the recipe way made many attempts (i.e. 27 times) to do the same job using different runs of try() and eval() functions.