0

I have a dataset of roughly 1600 samples. The whole dataset if made from 22 patients in total. Some patients contribute 250 samples, other patients just 10. This is a balanced dataset in total. I have around 800 samples for each class but the dataset of each individual patient is not balanced. I want to perform a binary classification with a 'Leave One Patient Out' cross Validation on that dataset.

I have a Patient ID linked to every sample in the entire dataset. I have split up my dataset in 80% train and 20% test. Is there any way I can implement said cross validation with sklearn? Possibly with the LeaveOneOut function?

I have tried to do it manually by writing a function that iterates over every patient and splitting the training data (80% of whole data) into another set of training and test data. The training data consists of samples of 21/22 patients and the test data of samples of 1/22 patients. I tested my classifier (Random Forest) on the 1/22 patients data and stored the accuracy. I repeated that process for every patient (22x) and calculated the mean accuracy of all patients. The result is around 55%. Then I tested my classifier on the remaining 20% of my whole dataset and received an accuracy of around 77%.

# Train Data (80%) and Test Data (20%)

clf_gini = RandomForestClassifier(criterion="gini",
                                      random_state=90, max_depth=8,
                                      min_samples_leaf=10)
accuracy_list = []      # Creating Empty List for Accuracy of every Cross Validation (22x)

for patient in patients:    # Iterate over every Patient (22x)

    # Function: Use Train Data to Create Train (21/22 Patients) and Test (1/22) Dataset for Cross Validation
    x_train, y_train, x_test, y_test = loocv(train_data, patient, num_feat)

    # Performing training
    clf_gini.fit(x_train, y_train)      # Train Classifier with data of 21 Patients
    y_pred = clf_gini.predict(x_test)   # Apply Classifier on remaining Patient to receive predicted Classes
    accuracy = accuracy_score(y_test, y_pred)   # Calculate Accuracy of Classification on remaining Patient
    accuracy_list.append(accuracy)  # Append Accuracy to list
# Exit loop
y_pred_test = clf_gini.predict(test_data[:, 1:-1])  # Apply Classifier to Test Dataset of all 22 Patients
mean_accuracy = np.mean(accuracy_list)              # Mean Accuracy of all Cross Validations
cal_accuracy(y_pred_test, test_data[:, 0])    # Function that prints Confusion Matrix, Test Accuracy, Report

Is it to be expected that my cross validation performed poorly compared to the accuracy on the test set? Is the way I approached this feasible or nonsense? Any input would be very much appreciated!

Samuel
  • 1
  • 1
  • I think you are looking for a splitting function like https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.LeavePGroupsOut.html where your group variable is your patient id and you can avoid in-patient leakage issues – Learning is a mess Jul 13 '23 at 09:59
  • Also there is https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.LeaveOneGroupOut.html to leave exactly one patient out – Learning is a mess Jul 13 '23 at 10:01
  • Are you sure you want to split the samples from one patient between training and test set? The samples from one patient are expected to be highly correlated, so this approach will over-estimate the accuracy if you are planning to apply the model to new patients in the future. – Arne Jul 13 '23 at 13:48
  • I don't believe this is happening in my case. I am using the all data from one patient and split that away from the rest of the data. Meaning I have a training set of every patient but one, and a testing set of just one patient. Then I repeat that process 22 times to have every patient be the testing set. – Samuel Jul 13 '23 at 15:04

0 Answers0