how to define threshold value upon model training?
There is simply no threshold during model training; Random Forest is a probabilistic classifier, and it only outputs class probabilities. "Hard" classes (i.e. 0/1), which indeed require a threshold, are neither produced nor used in any stage of the model training - only during prediction, and even then only in the cases we indeed require a hard classification (not always the case). Please see Predict classes or class probabilities? for more details.
Actually, the scikit-learn implementation of RF doesn't actually employ a threshold at all, even for hard class prediction; reading closely the docs for the predict
method:
the predicted class is the one with highest mean probability estimate across the trees
In simple words, this means that the actual RF output is [p0, p1]
(assuming binary classification), from which the predict
method simply returns the class with the highest value, i.e. 0 if p0 > p1
and 1 otherwise.
Assuming that what you actually want to do is return 1 if p1
is greater from some threshold less than 0.5, you have to ditch predict
, use predict_proba
instead, and then manipulate these returned probabilities to get what you want. Here is an example with dummy data:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=4,
n_informative=2, n_redundant=0,
n_classes=2, random_state=0, shuffle=False)
clf = RandomForestClassifier(n_estimators=100, max_depth=2,
random_state=0)
clf.fit(X, y)
Here, simply using predict
for, say, the first element of X
, will give 0:
clf.predict(X)[0]
# 0
because
clf.predict_proba(X)[0]
# array([0.85266881, 0.14733119])
i.e. p0 > p1
.
To get what you want (i.e. here returning class 1, since p1 > threshold
for a threshold of 0.11), here is what you have to do:
prob_preds = clf.predict_proba(X)
threshold = 0.11 # define threshold here
preds = [1 if prob_preds[i][1]> threshold else 0 for i in range(len(prob_preds))]
after which, it is easy to see that now for the first predicted sample we have:
preds[0]
# 1
since, as shown above, for this sample we have p1 = 0.14733119 > threshold
.