Cross validation for MNIST dataset with pytorch and sklearn

Question

I am new to pytorch and are trying to implement a feed forward neural network to classify the mnist data set. I have some problems when trying to use cross-validation. My data has the following shapes: x_train: torch.Size([45000, 784]) and y_train: torch.Size([45000])

I tried to use KFold from sklearn.

kfold =KFold(n_splits=10)

Here is the first part of my train method where I'm dividing the data into folds:

for  train_index, test_index in kfold.split(x_train, y_train): 
        x_train_fold = x_train[train_index]
        x_test_fold = x_test[test_index]
        y_train_fold = y_train[train_index]
        y_test_fold = y_test[test_index]
        print(x_train_fold.shape)
        for epoch in range(epochs):
         ...

The indices for the y_train_fold variable is right, it's simply: [ 0 1 2 ... 4497 4498 4499], but it's not for x_train_fold, which is [ 4500 4501 4502 ... 44997 44998 44999]. And the same goes for the test folds.

For the first iteration I want the varibale x_train_fold to be the first 4500 pictures, in other words to have the shape torch.Size([4500, 784]), but it has the shape torch.Size([40500, 784])

Any tips on how to get this right?

score 9 · Accepted Answer · answered Nov 23 '19 at 08:32

I think you're confused!

Ignore the second dimension for a while, When you've 45000 points, and you use 10 fold cross-validation, what's the size of each fold? 45000/10 i.e. 4500.

It means that each of your fold will contain 4500 data points, and one of those fold will be used for testing, and the remaining for training i.e.

For testing: one fold => 4500 data points => size: 4500
For training: remaining folds => 45000-4500 data points => size: 45000-4500=40500

Thus, for first iteration, the first 4500 data points (corresponding to indices) will be used for testing and the rest for training. (Check below image)

Given your data is x_train: torch.Size([45000, 784]) and y_train: torch.Size([45000]), this is how your code should look like:

for train_index, test_index in kfold.split(x_train, y_train):  
    print(train_index, test_index)

    x_train_fold = x_train[train_index] 
    y_train_fold = y_train[train_index] 
    x_test_fold = x_train[test_index] 
    y_test_fold = y_train[test_index] 

    print(x_train_fold.shape, y_train_fold.shape) 
    print(x_test_fold.shape, y_test_fold.shape) 
    break 

[ 4500  4501  4502 ... 44997 44998 44999] [   0    1    2 ... 4497 4498 4499]
torch.Size([40500, 784]) torch.Size([40500])
torch.Size([4500, 784]) torch.Size([4500])

So, when you say

I want the variable x_train_fold to be the first 4500 picture... shape torch.Size([4500, 784]).

you're wrong. this size corresonds to x_test_fold. In the first iteration, based on 10 folds, x_train_fold will have 40500 points, thus its size is supposed to be torch.Size([40500, 784]).

Would be very happy if you just could see through my code below! — Kimmen, Nov 23 '19 at 10:37
@helperFunction the iteration here refers to the KFold iteration, not epoch/iteration in training loop. — kHarshit, Jul 30 '21 at 15:57

score 8 · Answer 2 · answered Nov 23 '19 at 10:34

Think I have it right now, but I feel the code is a bit messy, with 3 nested loops. Is there any simpler way to it or is this approach okay?

Here's my code for the training with cross validation:

def train(network, epochs, save_Model = False):
    total_acc = 0
    for fold, (train_index, test_index) in enumerate(kfold.split(x_train, y_train)):
        ### Dividing data into folds
        x_train_fold = x_train[train_index]
        x_test_fold = x_train[test_index]
        y_train_fold = y_train[train_index]
        y_test_fold = y_train[test_index]

        train = torch.utils.data.TensorDataset(x_train_fold, y_train_fold)
        test = torch.utils.data.TensorDataset(x_test_fold, y_test_fold)
        train_loader = torch.utils.data.DataLoader(train, batch_size = batch_size, shuffle = False)
        test_loader = torch.utils.data.DataLoader(test, batch_size = batch_size, shuffle = False)

        for epoch in range(epochs):
            print('\nEpoch {} / {} \nFold number {} / {}'.format(epoch + 1, epochs, fold + 1 , kfold.get_n_splits()))
            correct = 0
            network.train()
            for batch_index, (x_batch, y_batch) in enumerate(train_loader):
                optimizer.zero_grad()
                out = network(x_batch)
                loss = loss_f(out, y_batch)
                loss.backward()
                optimizer.step()
                pred = torch.max(out.data, dim=1)[1]
                correct += (pred == y_batch).sum()
                if (batch_index + 1) % 32 == 0:
                    print('[{}/{} ({:.0f}%)]\tLoss: {:.6f}\t Accuracy:{:.3f}%'.format(
                        (batch_index + 1)*len(x_batch), len(train_loader.dataset),
                        100.*batch_index / len(train_loader), loss.data, float(correct*100) / float(batch_size*(batch_index+1))))
        total_acc += float(correct*100) / float(batch_size*(batch_index+1))
    total_acc = (total_acc / kfold.get_n_splits())
    print('\n\nTotal accuracy cross validation: {:.3f}%'.format(total_acc))

I think it's okay. There are always 2 loops for training, one for KFold is fine. You may want to look at [skorch](https://github.com/skorch-dev/skorch) - a sklearn wrapper for pytorch, though I haven't used it. — kHarshit, Nov 23 '19 at 14:14
Nice One! But just to bring it to the notice. This is not the accuracy we expect from cross-validation. What we need is average of accuracies of test_loader not the one with train_loader. (Or Am I missing something?) — dak, Dec 02 '20 at 05:10
@kHarshit Should not the weights of the model re-initialized after each fold ? Also, since the optimizer uses parameters of the model, does not it require creating a new instance of optimizer for each fold ? — Melike, Mar 01 '21 at 22:06

score 4 · Answer 3 · edited Nov 22 '19 at 15:30

4

You messed with indices.

x_train = x[train_index]
x_test = x[test_index]
y_train = y[train_index]
y_test = y[test_index]

    x_fold = x_train[train_index]
    y_fold = y_train[test_index]

It should be:

x_fold = x_train[train_index]
y_fold = y_train[train_index]

edited Nov 22 '19 at 15:30

cokeman19

2,405
1
25
40

answered Nov 22 '19 at 14:29

Piotr Rarus

884
8
16

You right! Updated the code and the question now, but still something wrong with my `x_train_fold` – Kimmen Nov 22 '19 at 14:36

score 0 · Answer 4 · answered Apr 25 '21 at 06:06

Though all the above answers provide a good example of how to split the dataset, I am curious about the way to implement the K-fold cross-validation. K-fold aims to estimate the skill of a machine learning model on unseen data. To use a limited sample to estimate how the model is expected to perform in general when used to make predictions on data not used during the training of the model. (See the concept and explanation in Wikipedia https://en.wikipedia.org/wiki/Cross-validation_(statistics)) Therefore, it is necessary to initialize the parameters of your to-be-trained model at the beginning of each fold. Otherwise, your model will see every sample in the dataset after K-fold and there is no such thing as validation (all are training samples).

Cross validation for MNIST dataset with pytorch and sklearn

4 Answers4

Linked