-1

I have created h20 random forest model for fraud prediction.now while scoring using predict function for test data. I got below dataframe from predict function output.

Now for 2nd records it predicted 1 but probability of p1 is far less than p0. What's the correct probability scores (p0/1) and classification we can use for my fraud prediction model?

If these are not correct probabilities then calibrated probabilities calculated using parameters(calibrate_model = True) as mentioned below will give correct probability?

    nfolds=5
    rf1 = h2o.estimators.H2ORandomForestEstimator(
        model_id = "rf_df1", 
        ntrees = 200,
        max_depth = 4,
        sample_rate = .30,
       # stopping_metric="misclassification",
       # stopping_rounds = 2, 
        mtries = 6,
        min_rows = 12,
        nfolds=3,
        distribution = "multinomial",
        fold_assignment="Modulo",
        keep_cross_validation_predictions=True,
        calibrate_model = True,
        calibration_frame = calib,
        weights_column = "weight",
        balance_classes = True
      #  stopping_tolerance = .005)
       )

        predict p0          p1
    1   0   0.9986012   0.000896514
    2   1   0.9985695   0.000448676
    3   0   0.9981387   0.000477767
Rob
  • 14,746
  • 28
  • 47
  • 65

1 Answers1

0

The prediction labels are based on a threshold and the threshold used is generally based on the threshold that maximizes the F1 score. See the following post to learn more on how to interpret the probability results.

Details on how the calibration frame and model work can be found here and here.

Lauren
  • 5,640
  • 1
  • 13
  • 19