I have a dataset of roughly 1600 samples. The whole dataset if made from 22 patients in total. Some patients contribute 250 samples, other patients just 10. This is a balanced dataset in total. I have around 800 samples for each class but the dataset of each individual patient is not balanced. I want to perform a binary classification with a 'Leave One Patient Out' cross Validation on that dataset.
I have a Patient ID linked to every sample in the entire dataset. I have split up my dataset in 80% train and 20% test. Is there any way I can implement said cross validation with sklearn? Possibly with the LeaveOneOut function?
I have tried to do it manually by writing a function that iterates over every patient and splitting the training data (80% of whole data) into another set of training and test data. The training data consists of samples of 21/22 patients and the test data of samples of 1/22 patients. I tested my classifier (Random Forest) on the 1/22 patients data and stored the accuracy. I repeated that process for every patient (22x) and calculated the mean accuracy of all patients. The result is around 55%. Then I tested my classifier on the remaining 20% of my whole dataset and received an accuracy of around 77%.
# Train Data (80%) and Test Data (20%)
clf_gini = RandomForestClassifier(criterion="gini",
random_state=90, max_depth=8,
min_samples_leaf=10)
accuracy_list = [] # Creating Empty List for Accuracy of every Cross Validation (22x)
for patient in patients: # Iterate over every Patient (22x)
# Function: Use Train Data to Create Train (21/22 Patients) and Test (1/22) Dataset for Cross Validation
x_train, y_train, x_test, y_test = loocv(train_data, patient, num_feat)
# Performing training
clf_gini.fit(x_train, y_train) # Train Classifier with data of 21 Patients
y_pred = clf_gini.predict(x_test) # Apply Classifier on remaining Patient to receive predicted Classes
accuracy = accuracy_score(y_test, y_pred) # Calculate Accuracy of Classification on remaining Patient
accuracy_list.append(accuracy) # Append Accuracy to list
# Exit loop
y_pred_test = clf_gini.predict(test_data[:, 1:-1]) # Apply Classifier to Test Dataset of all 22 Patients
mean_accuracy = np.mean(accuracy_list) # Mean Accuracy of all Cross Validations
cal_accuracy(y_pred_test, test_data[:, 0]) # Function that prints Confusion Matrix, Test Accuracy, Report
Is it to be expected that my cross validation performed poorly compared to the accuracy on the test set? Is the way I approached this feasible or nonsense? Any input would be very much appreciated!