0

I'm trying to use SuperLearner and it doesn't matter what algorithms I add to the library, it will only provides a discrete winner with coefficient 1. Is there an option to prevent that from happening?

Code:

library(SuperLearner)

Call:  
SuperLearner(Y = msicudatatrain$IsDeceased, X = x, family = binomial(), 
SL.library = c("SL.mean", "SL.glmnet",  
    "SL.ksvm", "SL.rpart"), verbose = TRUE) 


                      Risk Coef
SL.mean_All   1.684285e-01    0
SL.glmnet_All 4.483909e-07    0
SL.ksvm_All   1.750231e-03    0
SL.rpart_All  0.000000e+00    1

now excluding rpart, same situation happens...

Call:  
SuperLearner(Y = msicudatatrain$IsDeceased, X = x, family = binomial(),             SL.library = c("SL.mean", "SL.glmnet",  
    "SL.ksvm"), verbose = TRUE) 


                      Risk Coef
SL.mean_All   1.683833e-01    0
SL.glmnet_All 4.482701e-07    1
SL.ksvm_All   1.989397e-03    0

If I try a continuous Y variable (in this case, hospital length of stays), it also give a discrete winner, which seems counter intuitive.

Call:  
SuperLearner(Y = msicudatatrain$ICU_LOS_Clinical, X = x, family = gaussian(), 
SL.library = c("SL.mean", "SL.glmnet",  
    "SL.ksvm", "SL.randomForest", "SL.rpart"), verbose = TRUE) 


                           Risk Coef
SL.mean_All         51.59664196    0
SL.glmnet_All        0.05281076    1
SL.ksvm_All          2.69611753    0
SL.randomForest_All  2.00135683    0
SL.rpart_All         1.38172213    0

What should I do?

FFR
  • 1
  • 1

1 Answers1

0

From your result No 1:

It shows that rpart has no risk/ error, therefore it is the clear winner and adding other learners will only increase the risk/error in the prediction.

Similarly, from result No 2:

It shows that glmnet has risk/ error in prediction which is magnitudes less than other two.

It seems either your classes are pretty well separated (Risk of 0.0000 from rpart) or there is some modelling error. I would suggest you to run the classification models individually and check how they perform (i.e. compare error in predictions)

Modelling error: One possibility could be forgetting to convert factor into binary features before providing it to the SuperLearner. SuperLearner provides a wrapper to use already implemented algorithms in R. An algorithm may or may not have a provision to directly handle factors, so you need to convert factors/ categorical features into binary features (0/1) before providing it to the SuperLearner.

Refer the official R guide for data pre-processing and usage for SuperLearners: Dataset Pre-processing for SuperLearner It mentions: "If we had factor variables we would use model.matrix() to convert to numerics."

Mankind_008
  • 2,158
  • 2
  • 9
  • 15
  • Thanks @Mankind_08, this is also happening when I use SuperLearner for regressions (in my example, prediction of patient length of stays). This seems odd to me. – FFR Jun 22 '18 at 14:46
  • you will have to reproduce your data and code for me or others to help better. see this [Post](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – Mankind_008 Jun 22 '18 at 18:32
  • Thanks @Mankind_08 . I found a modelling error and it seems to work now. Quick question: can superlearner handle categorical variables? Thanks, – FFR Jun 23 '18 at 19:00
  • In short. SuperLearner provides a wrapper to use already implemented algorithms in R. An algorithm may or may not have a provision to directly handle factors, So you need to convert factors/ categorical features into binary features before providing it to the SuperLearner. Also if this answer help you solve the issue, then don't forget to upvote/ accept so others can also benefit from your issue. – Mankind_008 Jun 23 '18 at 19:34