I would like to run a K-fold cross validation on my data using a classifier. I want to include the prediction (or predicted probability) columns for each sample directly into the initial dataset/dataframe. Any ideas?
from sklearn.metrics import accuracy_score
import pandas as pd
from sklearn.model_selection import KFold
k = 5
kf = KFold(n_splits=k, random_state=None)
acc_score = []
auroc_score = []
for train_index , test_index in kf.split(X):
X_train , X_test = X.iloc[train_index,:],X.iloc[test_index,:]
y_train , y_test = y[train_index] , y[test_index]
model.fit(X_train, y_train)
pred_values = model.predict(X_test)
predict_prob = model.predict_proba(X_test.values)[:,1]
auroc = roc_auc_score(y_test, predict_prob)
acc = accuracy_score(pred_values , y_test)
auroc_score.append(auroc)
acc_score.append(acc)
avg_acc_score = sum(acc_score)/k
print('accuracy of each fold - {}'.format(acc_score))
print('Avg accuracy : {}'.format(avg_acc_score))
print('AUROC of each fold - {}'.format(auroc_score))
print('Avg AUROC : {}'.format(sum(auroc_score)/k))
Given this code, how could I begin to generate such an idea: add a prediction column or, even better, the prediction probability columns for each sample within the initial dataset?
In 10-fold cross-validation, each example (sample) will be used exactly once in a test set and 9 times in a training set. So, after 10-fold cross-validation, the result should be a dataframe where I would have the predicted class for ALL examples in the dataset. Each example will be assigned its initial features, its labelled class, and the class predicted computed in the cross-validation fold where that example was used in the test set.