How to understand the metrics of H2OModelMetrics Object through h2o.performance

Question

After creating the model using h2o.randomForest, then using:

perf <- h2o.performance(model, test)
print(perf)

I get the following information (value H2OModelMetrics object)

H2OBinomialMetrics: drf

MSE:  0.1353948
RMSE:  0.3679604
LogLoss:  0.4639761
Mean Per-Class Error:  0.3733908
AUC:  0.6681437
Gini:  0.3362873

Confusion Matrix (vertical: actual; across: predicted) 
for F1-optimal threshold:
          0    1    Error        Rate
0      2109 1008 0.323388  =1008/3117
1       257  350 0.423394    =257/607
Totals 2366 1358 0.339689  =1265/3724

Maximum Metrics: Maximum metrics at their respective thresholds
                        metric threshold    value idx
1                       max f1  0.080124 0.356234 248
2                       max f2  0.038274 0.515566 330
3                 max f0point5  0.173215 0.330006 131
4                 max accuracy  0.288168 0.839957  64
5                max precision  0.941437 1.000000   0
6                   max recall  0.002550 1.000000 397
7              max specificity  0.941437 1.000000   0
8             max absolute_mcc  0.113838 0.201161 195
9   max min_per_class_accuracy  0.071985 0.621087 262
10 max mean_per_class_accuracy  0.078341 0.626921 251

Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` 
or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

I use to look at sensitivity (recall) and specificity for comparing the quality of my prediction model, but with the information provided I am not able to understand in terms of such metrics. Based on above information how can I evaluate the quality of my prediction?

If I compute such metrics using the confusion matrix I get: sens=0.58, spec=0.68that is different from the information provided.

If there any way to get such values like we have using confusionMatrix from caret package?

For me it is more intuitive this metric:

$\sqrt{(1-spec)^2+(1-sen)^2}$

than logLoss metric.

score 13 · Accepted Answer · answered Apr 29 '17 at 21:05

The binomial classification models in h2o return a probability (p) of the prediction being a "1" (and they also, redundantly will tell you the probability of it being a "0", i.e. 1-p).

To use this model you have to decide the cutoff. E.g. you could split it down the middle, if p > 0.5 for "1", then it is "1", otherwise it is a "0". But you could choose other values, and what you are seeing in this report is the model quality at different cutoffs: that is what you are seeing in the "threshold" column. The extreme values (remember, based on the test data you have given it) are these two:

5                max precision  0.941437 1.000000   0
6                   max recall  0.002550 1.000000 397

I.e. if you specify the cutoff as 0.94, it has perfect precision, and if you specify the cutoff as 0.00255 it has perfect recall.

The default confusion matrix it shows is using this line:

3                 max f0point5  0.173215 0.330006 131

(The answer to this question looks to explain that metric in more detail.)

Personally, I find max accuracy the most intuitive:

4                 max accuracy  0.288168 0.839957  64

I.e. maximum accuracy means the threshold which has the lowest error.

Whichever of these metrics you decide is most suitable, you are still left with having to decide a threshold, for your real-world unseen data. One approach is to use the threshold from the table, based on your test data (so if I think max accuracy is most important, I would use a threshold of 0.288 in my live application.) But I've found that averaging the threshold from the test data and from the train data to give more solid results.

P.S. After resisting for a while, I've come around to being a fan of logloss. I've found models tuned for best logloss (rather than tuning for best recall, best precision, best accuracy, lowest MSE, etc, etc.) tended to be more robust when turned into real-world applications.

Thanks, in my case the threshold is very different for specificity and recall, so I don't have a good threshold that maximizes both. I am using the default value for `stopping_metric` (=`logloss` for classification), but I don't know how to interpret the value: `0.46`. The range of this metric can be even greater than one. I guess it is not a good result due to my `sens`, `spec` values, but according to this metric, I don't know exactly how to understand it. — David Leal, May 01 '17 at 14:56
Any idea how one would use the `idx` value in that table? What's it an index of? — Hack-R, Feb 08 '19 at 14:43
@Hack-R It appears to be a 0..399 index into the bins used; i.e. it represents where on the x-axis of an AUC chart that optimal point is. See https://github.com/h2oai/h2o-3/blob/745911ed1421592193451d9096f8640de1fa6d7a/h2o-core/src/main/java/hex/AUC2.java#L104 and the definition of NBINS in that same file. — Darren Cook, Feb 08 '19 at 16:24

score 2 · Answer 2 · answered May 18 '17 at 13:31

I like to read it differently. You already have the Confusion Matrix, for some problems you can (of-course!) straightaway calculate accuracy as (True Positives+True Negatives)/Total cases but for rest there is a tendency to go for Balanced Accuracy (depends on the number of predictors your have to counter multi-collinearity and remove biased from different sample size of response cases).

Balanced Accuracy = ((TP/P)+(TN/N))/2 TP True Positive TN True Negative P Actual Positive N Actual Negatives

This gives a true picture wrt to your Specificity and Sensitivity cases as well

How to understand the metrics of H2OModelMetrics Object through h2o.performance

2 Answers2

Linked