5

In my problem dataset response variable is extremely skewed to the left. I have tried to fit the model with h2o.randomForest() and h2o.gbm() as below. I can give tune min_split_improvement and min_rows to avoid overfitting in these two cases. But with these models, I see very high errors on the tail observations. I have tried using weights_column to oversample the tail observations and undersample other observations, but it does not help.

h2o.model <- h2o.gbm(x = predictors, y = response, training_frame = train,valid = valid, seed = 1,
                              ntrees =150, max_depth = 10, min_rows = 2, model_id = "GBM_DD", balance_classes = T, nbins = 20, stopping_metric = "MSE", 
                     stopping_rounds = 10, min_split_improvement = 0.0005)


h2o.model <- h2o.randomForest(x = predictors, y = response, training_frame = train,valid = valid, seed = 1,ntrees =150, max_depth = 10, min_rows = 2, model_id = "DRF_DD", balance_classes = T, nbins = 20, stopping_metric = "MSE", 
                     stopping_rounds = 10, min_split_improvement = 0.0005)

I have tried the h2o.automl() function of h2o package for the problem for better performance. However, I see significant overfitting. I don't know of any parameters in h2o.automl() to control overfitting.

Does anyone know of a way to avoid overfitting with h2o.automl()?

EDIT

The distribution of the log transformed response is given below. After the suggestion from Erin enter image description here

EDIT2: Distribution of original response.

enter image description here

deepAgrawal
  • 673
  • 1
  • 7
  • 25
  • 1
    Perhaps try to transform the problematic features, caret has a useful function `BoxCoxTrans` that can help with skewness. – missuse Jan 18 '18 at 21:06
  • It looks like a "Possion distribution", so i would either use a linear model where I specify the distribution, or I would try to use boosting which will handle this. Here is what boosting does: https://i2.wp.com/freakonometrics.hypotheses.org/files/2015/07/boosting-algo-0_v2.gif?w=456&ssl=1 – Esben Eickhardt Jan 25 '18 at 10:05

2 Answers2

13

H2O AutoML uses H2O algos (e.g. RF, GBM) underneath, so if you're not able to get good models there, you will suffer from the same issues using AutoML. I am not sure that I would call this overfitting -- it's more that your models are not doing well at predicting outliers.

My recommendation is to log your response variable -- that's a useful thing to do when you have a skewed response. In the future, H2O AutoML will try to detect a skewed response automatically and take the log, but that's not a feature of the the current version (H2O 3.16.*).

Here's a bit more detail if you are not familiar with this process. First, create a new column, e.g. log_response, as follows and use that as the response when training (in RF, GBM or AutoML):

train[,"log_response"] <- h2o.log(train[,response])

Caveats: If you have zeros in your response, you should use h2o.log1p() instead. Make sure not to include the original response in your predictors. In your case, you don't need to change anything because you are already explicitly specifying the predictors using a predictors vector.

Keep in mind that when you log the response that your predictions and model metrics will be on the log scale. So if you need to convert your predictions back to the normal scale, like this:

model <- h2o.randomForest(x = predictors, y = "log_response", 
                          training_frame = train, valid = valid)
log_pred <- h2o.predict(model, test)
pred <- h2o.exp(log_pred)

This gives you the predictions, but if you also want to see the metrics, you will have to compute those using the h2o.make_metrics() function using the new preds rather than extracting the metrics from the model.

perf <- h2o.make_metrics(predicted = pred, actual = test[,response])
h2o.mse(perf)

You can try this using RF like I showed above, or a GBM, or with AutoML (which should give better performance than a single RF or GBM).

Hopefully that helps improve the performance of your models!

Erin LeDell
  • 8,704
  • 1
  • 19
  • 35
  • I thought transforming the data will not help with ensemble models since they are nonlinear models and can capture these patterns. The performance does improve on the tail side although not as much as I would like. I will also play around with the weight feature to see if that helps. Maybe I did not use high enough weight. – deepAgrawal Jan 19 '18 at 22:33
  • 1
    You are transforming the target response, not the features (very different things). After logging the response, does the distribution look a bit more normal? – Erin LeDell Jan 19 '18 at 22:49
  • It is less skewed but still, it does not look like normal. I have updated my original post with the transformed distribution. – deepAgrawal Jan 19 '18 at 23:04
  • Yep, that's still relatively skewed. You must have really big outliers in the original response. I'd be interested in seeing the distribution of the original response... – Erin LeDell Jan 22 '18 at 01:08
  • Hi Erin, sorry for the delay. I just updated my post with the original response distribution. – deepAgrawal Jan 22 '18 at 16:02
0

When your target variable is skewed, mse is not a good metric to use. I would try changing the loss function because gbm tries to fit the model to the gradient of the loss function and you want to make sure that you are using the correct distribution. if you have a spike on zero and right skewed positive target, probably Tweedie would be a better option.

Rio
  • 398
  • 2
  • 15