How can we include a prediction column in the initial dataset/dataframe after performing K-Fold cross validation?

Question

I would like to run a K-fold cross validation on my data using a classifier. I want to include the prediction (or predicted probability) columns for each sample directly into the initial dataset/dataframe. Any ideas?

from sklearn.metrics import accuracy_score
import pandas as pd
from sklearn.model_selection import KFold

k = 5
kf = KFold(n_splits=k, random_state=None)

acc_score = []
auroc_score = []

for train_index , test_index in kf.split(X):
    X_train , X_test = X.iloc[train_index,:],X.iloc[test_index,:]
    y_train , y_test = y[train_index] , y[test_index]

    model.fit(X_train, y_train)
    pred_values = model.predict(X_test)
    predict_prob = model.predict_proba(X_test.values)[:,1]

    auroc = roc_auc_score(y_test, predict_prob)
    acc = accuracy_score(pred_values , y_test)

    auroc_score.append(auroc)
    acc_score.append(acc)

avg_acc_score = sum(acc_score)/k
print('accuracy of each fold - {}'.format(acc_score))
print('Avg accuracy : {}'.format(avg_acc_score))
print('AUROC of each fold - {}'.format(auroc_score))
print('Avg AUROC : {}'.format(sum(auroc_score)/k))

Given this code, how could I begin to generate such an idea: add a prediction column or, even better, the prediction probability columns for each sample within the initial dataset?

In 10-fold cross-validation, each example (sample) will be used exactly once in a test set and 9 times in a training set. So, after 10-fold cross-validation, the result should be a dataframe where I would have the predicted class for ALL examples in the dataset. Each example will be assigned its initial features, its labelled class, and the class predicted computed in the cross-validation fold where that example was used in the test set.

`predict_proba` returns an array. What do you want the output to look like? The probability for the predicted class, or a list of values in your dataframe column? — artemis, Dec 09 '21 at 15:59
Thanks for helping, what I wish is that at the conclusion of the K fold cross process, I would have a dataframe similar to the initial dataset but with one additional column containing the results of all predicts, or/and two additional columns containing proba 0 and proba 1 assuming I am performing binary class classification, and that on each raw of the dataset. Is that even possible? I am rather certain I have heard about it. If you want you may continue to ask me questions to ensure that we are on the same page, I'll be prompted — Simon Provost, Dec 09 '21 at 16:15
Can you please post an example of what you want in your question? `predict_proba` returns a list of probabilities, i.e. `[0.6, 0.2, 0.1, 0.1]`. Do you want that list in your column? Please post an example of what you are trying to accomplish. — artemis, Dec 09 '21 at 16:17
@artemis Allright, I did make a correction to my initial post. I believe we can disregard predict proba at this moment; the idea is to be able to report the results of the K-fold cross-val to new columns in the initial dataset; if it is the predicted class for now, that is fine; I will handle it by reporting two columns with the results of predict proba later on. Please refer to my example and feel free to get back to me for any details; I apologise for the confusion a bit. — Simon Provost, Dec 09 '21 at 16:26

score 2 · Answer 1 · answered Dec 09 '21 at 22:37

You can use cross_val_predict, see help page, it basically returns you the cross validated estimates:

from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_predict, KFold
from sklearn.metrics import accuracy_score
from sklearn import datasets, linear_model
from sklearn.linear_model import LogisticRegression
import pandas as pd
 

X,y = make_classification()
df = pd.DataFrame(X,columns = ["feature{:02d}".format(i) for i in range(X.shape[1])])
df['label'] = y

df['pred'] = cross_val_predict(LogisticRegression(), X, y, cv=KFold(5))

artemis · Accepted Answer · 2021-12-09T16:58:40.013

You can use the .loc method to accomplish this. This question has a nice answer that shows how to use it: df.loc[index_position, "column_name"] = some_value

So, an edited version of the code you posted (I needed data, and removed auc_roc since we aren't using probabilities per your edit):

from sklearn.metrics import accuracy_score, roc_auc_score
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.datasets import load_breast_cancer
from sklearn.neural_network import MLPClassifier

X,y = load_breast_cancer(return_X_y=True, as_frame=True)
model = MLPClassifier()

k = 5
kf = KFold(n_splits=k, random_state=None)

acc_score = []
auroc_score = []

# Create columns
X['Prediction'] = 1

# Define what values to use for the model
model_columns = [x for x in X.columns if x != 'Prediction']

for train_index , test_index in kf.split(X):
    X_train , X_test = X.iloc[train_index,:],X.iloc[test_index,:]
    y_train , y_test = y[train_index] , y[test_index]

    model.fit(X_train[model_columns], y_train)
    pred_values = model.predict(X_test[model_columns])

    acc = accuracy_score(pred_values , y_test)
    acc_score.append(acc)

    # Add values to the dataframe
    X.loc[test_index, 'Prediction'] = pred_values

avg_acc_score = sum(acc_score)/k
print('accuracy of each fold - {}'.format(acc_score))
print('Avg accuracy : {}'.format(avg_acc_score))

# Add label back per question
X['Label'] = y

# Print first 5 rows to show that it works
print(X.head(n=5))

Yields

accuracy of each fold - [0.9210526315789473, 0.9122807017543859, 0.9736842105263158, 0.9649122807017544, 0.8672566371681416]
Avg accuracy : 0.927837292345909
   mean radius  mean texture  ...  Prediction  Label
0        17.99         10.38  ...           0      0
1        20.57         17.77  ...           0      0
2        19.69         21.25  ...           0      0
3        11.42         20.38  ...           1      0
4        20.29         14.34  ...           0      0

[5 rows x 32 columns]

(Obviously the model/values etc are all arbitrary)

How can we include a prediction column in the initial dataset/dataframe after performing K-Fold cross validation?

2 Answers2