5

I'm trying to use the R package mlr to train a glmnet model on a binary classification problem with a large dataset (about 850000 rows and about 100 features) on very modest hardware (my laptop with 4GB RAM --- I don't have access to more CPU muscle). I decided to use mlr because I need to use nested cross-validation to tune the hyperparameters of my classifier and evaluate the expected performance of the final model. To the best of my knowledge, neither caret or h2o offer nested cross-validation at present, but mlr provides provides the infrastructure to do this. However, I find the huge number of functions provided by mlr extremely overwhelming, and it's difficult to know how to slot everything together to achieve my goal. What goes where? How do they fit together? I've read through the entire documentation here: https://mlr-org.github.io/mlr-tutorial/release/html/ and I'm still confused. There are code snippets that show how to do specific things, but it's unclear (to me) how to stitch these together. What's the big picture? I looked for a complete worked example to use as a template and only found this: https://www.bioconductor.org/help/course-materials/2015/CSAMA2015/lab/classification.html which I have been using as my start point. Can anyone help fill in the gaps?

Here's what I want to do:

Tune the hyperparameters (l1 and l2 regularisation parameters) of a glmnet model using grid search or random grid search (or anything faster if it exists -- iterated F-racing? Adaptive resampling?) and stratified k-fold cross-validation inner loop, with an outer cross-validation loop to assess the expected final performance. I want to include a feature preprocessing step in the inner loop with centering, scaling, and Yeo-Johnson transformation, and fast filter-based feature selection (the latter is a necessity because I have very modest hardware and I need to slim the feature space to decrease training time). I have imbalanced classes (positive class is about 20%) so I have opted to use AUC as my optimisation objective, but this is only a surrogate for the real metric of interest, with is the false positive rate for a small number of true positive fixed points (i.e., I want to know the FPR for TPR = 0.6, 0.7, 0.8). I'd like to tune the probability thresholds to achieve those TPRs, and note that this is possible in nested CV, but it's not clear exactly what is being optimised here: https://github.com/mlr-org/mlr/issues/856 I'd like to know where the cut should be without incurring information leakage, so I want to pick this using CV.

I'm using glmnet because I'd rather spend my CPU cycles on building a robust model than a fancy model that produces over-optimistic results. GBM or Random Forest can be done later if I find it can be done fast enough, but I don't expect the features in my data to be informative enough to bother investing much time in training anything particularly complex.

Finally, after I've obtained an estimate of what performance I can expect from the final model, I want to actually build the final model and obtain the coefficients of the glmnet model --- including which ones are zero, so I know which features have been selected by the LASSO penalty.

Hope all this makes sense!

Here's what I've got so far:

df <- as.data.frame(DT)

task <- makeClassifTask(id = "glmnet", 
                        data = df, 
                        target = "Flavour", 
                        positive = "quark")
task


lrn <- makeLearner("classif.glmnet", predict.type = "prob")
lrn

# Feature preprocessing -- want to do this as part of CV:
lrn <- makePreprocWrapperCaret(lrn,
                               ppc.center = TRUE, 
                               ppc.scale = TRUE,
                               ppc.YeoJohnson = TRUE)
lrn

# I want to use the implementation of info gain in CORElearn, not Weka:
infGain = makeFilter(
  name = "InfGain",
  desc = "Information gain ",
  pkg  = "CORElearn",
  supported.tasks = c("classif", "regr"),
  supported.features = c("numerics", "factors"),
  fun = function(task, nselect, ...) {
    CORElearn::attrEval(
      getTaskFormula(task), 
      data = getTaskData(task), estimator = "InfGain", ...)
  }
)
infGain

# Take top 20 features:
lrn <-  makeFilterWrapper(lrn, fw.method = "InfGain", fw.abs = 20)
lrn

# Now things start to get foggy...

tuningLrn <- makeTuneWrapper(
  lrn, 
  resampling = makeResampleDesc("CV", iters = 2,  stratify = TRUE), 
  par.set = makeParamSet(
    makeNumericParam("s", lower = 0.001, upper = 0.1),
    makeNumericParam("alpha", lower = 0.0, upper = 1.0)
  ), 
  control = makeTuneControlGrid(resolution = 2)
)

r2 <- resample(learner = tuningLrn, 
               task = task, 
               resampling = rdesc, 
               measures = auc)
# Now what...?
  • 1
    It looks like you have all the pieces in place. It sounds like you would want a [custom measure](https://mlr-org.github.io/mlr-tutorial/release/html/create_measure/index.html), but apart from that you should have all the information. The parameters for the best models are part of the resample return value (`r2$extract`, see [here](https://mlr-org.github.io/mlr-tutorial/release/html/nested_resampling/index.html)), and will also give you an estimate of the performance. I would start with random search instead of grid search. Are you saying that the code you've posted doesn't work for you? – Lars Kotthoff Aug 29 '16 at 21:02
  • Hi Lars, thanks for the reply! In answer to your question ("Are you saying that the code you've posted doesn't work for you?"), I'm not sure. I want to train using AUC as the optimisation objective, and I have specified this as the value of the argument "measures" in the resample function, but during training I see the following output: [Tune-y] 1: mmce.test.mean=0.179; time: 4.5 min; memory: 222Mb use, 1440Mb max This is not what I was expecting. The output seems to indicate that the optimisation objective is misclassification error and not AUC. I'm uncertain if AUC is being used. – Dr. Andrew John Lowe Aug 31 '16 at 14:24
  • Specific questions: 1. How do I tune the probability threshold to achieve a specified TPR? Github issue 856 contains a script, but I don't see what is being optimised. 2. The resample return value (r2$extract above) is NULL for entries [[1]] to [[5]]. No parameters or performance estimate. How to get these? I still don't know how to build the final model and get coefficients. [This post](http://stats.stackexchange.com/questions/65128/nested-cross-validation-for-model-selection) suggests performing model selection on the whole dataset to get the final model. What's the correct procedure? – Dr. Andrew John Lowe Aug 31 '16 at 14:38
  • 1
    You need to set `auc` for the inner resampling as well. 1. Like I said, I would do this with a custom measure. 2. This could be of errors in the runs (too much memory?). If your tuned parameter sets are very similar, you can simply choose one of them. – Lars Kotthoff Aug 31 '16 at 18:42
  • Thanks! 1. I'm thinking tuneParamsMultiCrit might be helpful, with two measures: fpr and "tpr60", where the latter is a custom measure equal to abs(tpr - 0.6). This would minimise the fpr for a target tpr of 60%. But I haven't been able to get this to work in nested CV. 2. Fixed this: needed extract = getTuneResult in resample function. I'm getting close. Basically, I'm missing the method for tuning the probability threshold for a target tpr, and I'd like to speed things up a bit. I tried irace: crashes in nested CV. Haven't tried cma_es or GenSA, so not know if faster or will work. – Dr. Andrew John Lowe Aug 31 '16 at 19:04
  • RE: building final model. My understanding, based on [this post](http://stats.stackexchange.com/questions/65128/nested-cross-validation-for-model-selection) and similar on SO, is that nested CV will tell me the estimated performance I can expect of the final model if I apply the same *method* that I used in the inner CV loop in order to build the final model, ie, I understand I perform hyperparam tuning on the full dataset as I did in the inner CV to get the final model. This is contrary to my previous understanding that model is built on *fixed* best params learned from CV. Please confirm! – Dr. Andrew John Lowe Aug 31 '16 at 19:11
  • 1
    The same *method* here means the same model (including hyperparameters). It *doesn't* mean tuning -- this is an iterative process that evaluates multiple times. I think a single measure (a combination of the two measures you propose) is the way to go here. Have you tried random sampling? You can always reduce the number of iterations to make it faster. – Lars Kotthoff Aug 31 '16 at 20:08
  • Still confused about getting final model -- sorry! If I tune L1 and L2 hyperparams in glmnet in nested CV, then I'll have as many sets of hyperparams as there are outer loops in the nested CV -- not a single set that I fix to build the final model. Testing a single point in param space is already slow, so although I expect RGS will be faster, it'll still be very slow on my hardware. I'm thinking of using your FSelector package to slim feature space, but I'm aware that perf may be sub-optimal wrt to LASSO regularisation. Can't figure out single measure fpr+tpr comb, so will report ROC instead. – Dr. Andrew John Lowe Sep 01 '16 at 15:45
  • 1
    I would have a look at the result first -- like I said, the values of the parameters may all be very close. Otherwise you can tune on the entire dataset (with a CV). Do you get much better results when you tune for longer? Otherwise, simply reduce the number of points. – Lars Kotthoff Sep 01 '16 at 18:04
  • There's quite a bit of variability in the hyperparameter values, which I suppose further underscores the need for nested CV. "Otherwise you can tune on the entire dataset (with a CV)" -- sounds like what I had in mind earlier, i.e., replicate the complete procedure (with tuning) in the inner CV of the earlier nested CV on the entire dataset in order to build the final model. Is this the method you propose for building the final model? – Dr. Andrew John Lowe Sep 02 '16 at 11:51
  • 1
    This is how I would determine the parameters for building the final model. It may help if you describe the use case in more detail. – Lars Kotthoff Sep 02 '16 at 17:07
  • Thanks! Then I think I understand now how to build the final model. And I also know now how to tune the probability threshold using `set.threshold = TRUE` in a `makeTuneControl*` function. I'm following your suggestion of making a single custom measure to minimise the FPR for a specified TPR, but I don't know how to proceed. Not sure what would be an ideal measure. This appears to be the final hurdle. See here: https://github.com/mlr-org/mlr/issues/856#issuecomment-245526617 – Dr. Andrew John Lowe Sep 08 '16 at 16:57
  • 1
    for the outer CV evaluation hyperparameters why not take the average? if there is variability with these parameters, then the inner loop needs to be increased in iterations. – sophie-germain Jul 12 '17 at 18:48

0 Answers0