I'm trying to use the R package mlr to train a glmnet model on a binary classification problem with a large dataset (about 850000 rows and about 100 features) on very modest hardware (my laptop with 4GB RAM --- I don't have access to more CPU muscle). I decided to use mlr because I need to use nested cross-validation to tune the hyperparameters of my classifier and evaluate the expected performance of the final model. To the best of my knowledge, neither caret or h2o offer nested cross-validation at present, but mlr provides provides the infrastructure to do this. However, I find the huge number of functions provided by mlr extremely overwhelming, and it's difficult to know how to slot everything together to achieve my goal. What goes where? How do they fit together? I've read through the entire documentation here: https://mlr-org.github.io/mlr-tutorial/release/html/ and I'm still confused. There are code snippets that show how to do specific things, but it's unclear (to me) how to stitch these together. What's the big picture? I looked for a complete worked example to use as a template and only found this: https://www.bioconductor.org/help/course-materials/2015/CSAMA2015/lab/classification.html which I have been using as my start point. Can anyone help fill in the gaps?
Here's what I want to do:
Tune the hyperparameters (l1 and l2 regularisation parameters) of a glmnet model using grid search or random grid search (or anything faster if it exists -- iterated F-racing? Adaptive resampling?) and stratified k-fold cross-validation inner loop, with an outer cross-validation loop to assess the expected final performance. I want to include a feature preprocessing step in the inner loop with centering, scaling, and Yeo-Johnson transformation, and fast filter-based feature selection (the latter is a necessity because I have very modest hardware and I need to slim the feature space to decrease training time). I have imbalanced classes (positive class is about 20%) so I have opted to use AUC as my optimisation objective, but this is only a surrogate for the real metric of interest, with is the false positive rate for a small number of true positive fixed points (i.e., I want to know the FPR for TPR = 0.6, 0.7, 0.8). I'd like to tune the probability thresholds to achieve those TPRs, and note that this is possible in nested CV, but it's not clear exactly what is being optimised here: https://github.com/mlr-org/mlr/issues/856 I'd like to know where the cut should be without incurring information leakage, so I want to pick this using CV.
I'm using glmnet because I'd rather spend my CPU cycles on building a robust model than a fancy model that produces over-optimistic results. GBM or Random Forest can be done later if I find it can be done fast enough, but I don't expect the features in my data to be informative enough to bother investing much time in training anything particularly complex.
Finally, after I've obtained an estimate of what performance I can expect from the final model, I want to actually build the final model and obtain the coefficients of the glmnet model --- including which ones are zero, so I know which features have been selected by the LASSO penalty.
Hope all this makes sense!
Here's what I've got so far:
df <- as.data.frame(DT)
task <- makeClassifTask(id = "glmnet",
data = df,
target = "Flavour",
positive = "quark")
task
lrn <- makeLearner("classif.glmnet", predict.type = "prob")
lrn
# Feature preprocessing -- want to do this as part of CV:
lrn <- makePreprocWrapperCaret(lrn,
ppc.center = TRUE,
ppc.scale = TRUE,
ppc.YeoJohnson = TRUE)
lrn
# I want to use the implementation of info gain in CORElearn, not Weka:
infGain = makeFilter(
name = "InfGain",
desc = "Information gain ",
pkg = "CORElearn",
supported.tasks = c("classif", "regr"),
supported.features = c("numerics", "factors"),
fun = function(task, nselect, ...) {
CORElearn::attrEval(
getTaskFormula(task),
data = getTaskData(task), estimator = "InfGain", ...)
}
)
infGain
# Take top 20 features:
lrn <- makeFilterWrapper(lrn, fw.method = "InfGain", fw.abs = 20)
lrn
# Now things start to get foggy...
tuningLrn <- makeTuneWrapper(
lrn,
resampling = makeResampleDesc("CV", iters = 2, stratify = TRUE),
par.set = makeParamSet(
makeNumericParam("s", lower = 0.001, upper = 0.1),
makeNumericParam("alpha", lower = 0.0, upper = 1.0)
),
control = makeTuneControlGrid(resolution = 2)
)
r2 <- resample(learner = tuningLrn,
task = task,
resampling = rdesc,
measures = auc)
# Now what...?