I am training a model to predict the label (target) based on loan status e.g. 0,1,2,3. So i have 4 classes. I have so far trained a model as follows:
from HyperclassifierSearch import HyperclassifierSearch
X = data.iloc[:, :-1]
y = data.label
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2,
random_state=42)
# Create a hold out dataset to train the calibrated model to prevent overfitting
X_train, X_validation, y_train, y_validation = train_test_split(X_train, y_train,
stratify=y_train, test_size=0.2, random_state=42)
categorical_transformer = OneHotEncoder(handle_unknown='ignore')
numeric_transformer = Pipeline(steps=[('imputer',SimpleImputer(missing_values=np.nan, fill_value=0) ),('scaler', StandardScaler())])
preprocessor = ColumnTransformer(transformers=[('num', numeric_transformer, numeric_cols),
('cat', categorical_transformer, cat_cols)])
#then i use hyperclassifer library
models = { 'xgb': Pipeline(steps=[('preprocessor', preprocessor),('clf', XGBClassifier(objective='multi:softprob'))]),
'rf': Pipeline(steps=[('preprocessor', preprocessor),('clf', RandomForestClassifier(criterion = 'entropy', random_state = 42))]) }
search = HyperclassifierSearch(models, params)
best_grid = search.train_model(X_train, y_train, cv=3, n_jobs=-1, scoring='accuracy')
results = search.evaluate_model()
fitted_model = best_grid.best_estimator_
pred = fitted_model.predict_proba(X_test)
labels = fitted_model.predict(X_test)
**note i have omitted alot of the imported libs and params dict since large and only included hyperclassifier since it is large **
my pred is a matrix containing 4 columns where each is related to the class of the loan. Generally i know it is good pracitce to calibrate the probabilities and particularly from tree based algorithms the output is a score not really a probability. I am however confused as to how to calibrate these probabilities.
Usually i would calibrate using the holdout validation set but am unsure how to do it with multiclass
Update
Should i ammend the above xgbclassifier by doing the following:
OneVsRestClassifier(CalibratedClassifierCV(XGBClassifier(objective='multi:softprob'), cv=10))
source: Multiclass linear SVM in python that return probability