Functionality of caret train trim parameter

Question

I was looking at reducing my size of my trained models (namely this and this post) and I have come across the trim parameter in the caret train function. Specifically, this was added in version 6.0-47, from the documentation:

If TRUE the final model in object\$finalModel may have some components of the object removed so reduce the size of the saved object. The predict method will still work, but some other features of the model may not work. triming will occur only for models where this feature has been implemented.

I realize the results of using trim may vary by method used. Is there a resource to determine what will be included and excluded from the final model after using the trim parameter? How much space could I expect to save? What (if any) functionality is lost?

In previous questions, it is ambiguous if the parameter could even save space. For example, here is a simple example where trim=T and trim=F return an object of the same size using randomForests:

library(caret)
library(pryr)

# make a large dataset so iris example is not too trivial
large_iris <-  iris[rep(seq_len(nrow(iris)), 10), ]
object_size(large_iris) # 1.38 MB

set.seed(1234)
mdl1 <- train(Species~.,data=large_iris,method="rf",trControl=trainControl(trim=F))
object_size(mdl1) # 1.24 MB
attributes(mdl1)

set.seed(1234)
mdl2 <- train(Species~.,data=large_iris,method="rf",trControl=trainControl(trim=T))
object_size(mdl2) # 1.24 MB
attributes(mdl2)

score 0 · Answer 1 · answered May 09 '18 at 08:24

The trim option (for now) does not work with randomForest.

If you search on github for issues with trim in it you will find this list of issues.

Issue 90 mentions this about trim:

This has been implemented for models bayesglm, C5.0, C5.0Cost, C5.0Rules, C5.0Tree, glm, glmnet, rpart, rpart2, and treebag. The current regression tests are passing.

This confirms my investigation of the code and you can see in the testthat functionality of caret that these are the models that are tested to see that trim outcome creates a smaller object size and that the predict outcome is still correct.

Also using the non-formula interface might reduce the footprint a bit. Caret does some extra work if you use a formula interface.

Thanks @phiver for the response and linking the methods implementing trim — cacti5, May 09 '18 at 18:25

score 0 · Accepted Answer · answered May 09 '18 at 18:35

I did a little digging into the caret package and the methods listed by @phiver. tldr; see below for details when using trim on bayesglm, C5.0, C5.0Cost, C5.0Rules, C5.0Tree, glm, glmnet, rpart, rpart2, and treebag. Else, setting trim=T for other methods has no effect

bayesglm and glm

Data output unnecessary to prediction are trimmed from the model so that the fitted model is constant in size with respect to training data size. The trimming function is

function(x) {
    x$y = c()
    x$model = c()

    x$residuals = c()
    x$fitted.values = c()
    x$effects = c()
    x$qr$qr = c()
    x$linear.predictors = c()
    x$weights = c()
    x$prior.weights = c()
    x$data = c()

    x$family$variance = c()
    x$family$dev.resids = c()
    x$family$aic = c()
    x$family$validmu = c()
    x$family$simulate = c()
    attr(x$terms,".Environment") = c()
    attr(x$formula,".Environment") = c()
    x$R <- c() #Not in a glm
    x$xNames <- c()
    x$xlevels <- c()
    x
    }

This is based primarily on this post that contains interesting analysis with this limitation:

One point and one caveat. You can null out model$family entirely; the predict function will still return its default value, the link value (that is, predict(model, newdata=data)) will work). However, predict(model, newdata=data, type='response') will fail. You can still recover the response by passing the link value through the inverse link function: in the case of logistic regression, this is the sigmoid function, sigmoid(x) = 1/(1 + exp(-x)).

The caveat: many of the other things besides predict that you might like to do with a glm model will fail on the stripped-down version: in particular summary(), anova() and step(). So any characterization that you want to do on a candidate model should be done before trimming down the fat. Once you have decided on a satisfactory model, you can strip it down and save it for use in future predictions.

I also included the trim functions for the other methods that have trimming capabilities directly from the caret package:

glmnet

function(x) {
    x$call <- NULL
    x$df <- NULL
    x$dev.ratio <- NULL
    x
  }

C5.0, C5.0Cost, C5.0Rules, C5.0Tree

function(x) {
    x$boostResults <- NULL
    x$size <- NULL
    x$call <- NULL
    x$output <- NULL
    x
    }

rpart, rpart2

function(x) {
    x$call <- list(na.action = (x$call)$na.action)
    x$x <- NULL
    x$y <- NULL
    x$where <- NULL
    x
    }

treebag

function(x) {
    trim_rpart <- function(x) {
      x$call <- list(na.action = (x$call)$na.action)
      x$x <- NULL
      x$y <- NULL
      x$where <- NULL
      x
    }
    x$mtrees <- lapply(x$mtrees, 
                       function(x){
                         x$bindx <- NULL
                         x$btree <- trim_rpart(x$btree)
                         x
                       } )
    x
    }