Threshold used at prediction

Question

According to the H2O documentation, the threshold used at prediction is the max F1 threshold from train. The performance function,

h2o.performance(model, newdata = test)

actually run the prediction on the test set in order to compute the confusion matrix.

Strangely I am getting different confusion matrix while predicting the same test set using :

h2o.predict(object, newdata=test).

It means that h2o.performance() is using a different threshold from h2o.predict(). I am wondering how can i dictate the threshold upon prediction.

score 2 · Accepted Answer · answered Dec 17 '19 at 01:29

H2O is using max F1 threshold for both h2o.performance() and h2o.predict(). The difference is what dataset it will use to estimate the max F1 threshold.

h2o.predict() will use the threshold it selected during training. It uses different max F1 thresholds depending on how the model was trained. Basically:

If you only have training data - the max F1 threshold comes from the train data model.
If there is validation data during training - the max F1 threshold comes from the validation data model.

This is explained in the documentation and also on stackoverflow. Depending on if you had validation data during training, you will see the max F1 threshold to be determined by your training or validation dataset.

h2o.performance() will take the model and newdata and calculate what threshold will give the highest F1 for the new data. In your case, test is being used to calculate max F1 threshold.

Threshold used at prediction

1 Answers1