0

My aim is to prove whether there is overfitting or underfitting. However, when I calculate the learning curves (graphically depict how a process is improved), the standard deviation of the cross-validation score is enormous.

My observation here is that after I change the cross-validation from Leave-One-Out Cross-Validation (LOOCV) to KFold, everything gets interpretable and normal, otherwise, with LOOCV, the standard deviation is high. I don´t understand why.

I choose LOOCV because my sample is very little. I want to use every sample as a test set until it reaches the end.

The second thing is, should I get the learning curves in the loop or outside?

X (data) and y (classes including only 0 and 1) is 1D dataset.

The code:

loo = LeaveOneOut()

lr_model = LogisticRegression()

y_pred = []

accuracy_scores = []
f1_scores = []
precision_scores = []
recall_scores = []


# loop through each fold in the LOOCV cross-validation
for train_index, test_index in loo.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    # fit the model on the training data
    lr_model.fit(X_train, y_train)
    
    # use the model to predict the labels for the test data
    y_pred_fold = lr_model.predict(X_test)
    y_pred.extend(y_pred_fold)
    
    # calculate evaluation metrics for this fold
    accuracy = accuracy_score(y_test, y_pred_fold)
    f1 = f1_score(y_test, y_pred_fold, zero_division=0)
    precision = precision_score(y_test, y_pred_fold, zero_division=0)
    recall = recall_score(y_test, y_pred_fold,  zero_division=0)
    
    accuracy_scores.append(accuracy)
    f1_scores.append(f1)
    precision_scores.append(precision)
    recall_scores.append(recall)
    
# calculate the overall evaluation metrics
accuracy = accuracy_score(y, y_pred)
f1 = f1_score(y, y_pred)
precision = precision_score(y, y_pred)
recall = recall_score(y, y_pred)

print("Overall Accuracy: %.2f%%" % accuracy)
print("Overall F1 Score: %.2f" % f1)
print("Overall Precision: %.2f" % precision)
print("Overall Recall: %.2f" % recall)

# Compute learning curve using LOOCV
train_sizes, train_scores, test_scores = learning_curve(lr_model, X, y, cv=loo, scoring='accuracy', n_jobs=-1)

# Compute mean and standard deviation of training and test scores
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)

# Plot learning curve with shaded standard deviation regions
plt.figure()
plt.title('Learning Curve')
plt.xlabel('Training Examples')
plt.ylabel('Accuracy')
plt.grid()

vdu16
  • 123
  • 10

0 Answers0