I am trying to implement MNIST using PyTorch Lightning. Here, I wanted to use k-fold
cross-validation.
The problem is I am getting the NaN
value from the loss function (for at least 1 fold). From below 3rd time, I was getting NaN
values from the loss function.
Epoch 19: 100%|█████████████████████████████████| 110/110 [00:03<00:00, 29.24it/s, loss=0.963, v_num=287]
Testing: 100%|███████████████████████████████████████████████████████████| 28/28 [00:00<00:00, 39.94it/s]
Epoch 19: 100%|█████████████████████████████████| 110/110 [00:04<00:00, 25.69it/s, loss=0.825, v_num=288]
Testing: 100%|███████████████████████████████████████████████████████████| 28/28 [00:00<00:00, 41.19it/s]
Epoch 19: 100%|███████████████████████████████████| 110/110 [00:03<00:00, 30.19it/s, loss=nan, v_num=289]
Testing: 100%|███████████████████████████████████████████████████████████| 28/28 [00:00<00:00, 42.15it/s
Or very big loss value (terminated before completing full epocs)
Epoch 0: 44%|█████████████▉ | 48/110 [00:02<00:02, 22.87it/s, loss=2.08e+23, v_num=295]
The code I have used for data preparation, k-fold, and trainer is given below
def prepare_data():
transform=transforms.Compose([transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))])
mnist_train = MNIST(os.getcwd(), train=True, download=True, transform=transform)
mnist_test = MNIST(os.getcwd(), train=False, download=True, transform=transform)
dataset = ConcatDataset([mnist_train, mnist_test])
return dataset
k_folds=5
epochs=20
kfold=KFold(n_splits=k_folds,shuffle=True)
dataset = prepare_data()
model = LightningMNIST(lr_rate=0.01)
for fold, (train_idx, val_idx) in enumerate(kfold.split(dataset)):
train_subsampler = torch.utils.data.SubsetRandomSampler(train_idx)
val_subsampler = torch.utils.data.SubsetRandomSampler(val_idx)
train_loader = torch.utils.data.DataLoader(dataset, num_workers=8, batch_size=512, sampler=train_subsampler)
val_loader = torch.utils.data.DataLoader(dataset, num_workers=8, batch_size=512, sampler=val_subsampler)
model.apply(reset_weights) # reset model for every fold
early_stopping = EarlyStopping('train_loss', mode='min', patience=5)
model_checkpoint = ModelCheckpoint(dirpath=model_path+'mnist_{epoch}-{train_loss:.2f}',
monitor='train_loss', mode='min', save_top_k=3)
trainer = pl.Trainer(max_epochs=epochs, profiler=False, callbacks = [model_checkpoint],default_root_dir=model_path)
trainer.fit(model, train_dataloader=train_loader)
trainer.test(test_dataloaders=val_loader, ckpt_path=None)
The training step is given below
def training_step(self, train_batch, batch_idx):
x, y = train_batch
logits = self.forward(x)
loss = self.error_loss(logits.squeeze(-1), y.float())
self.log('train_loss', loss)
return {'loss': loss}
I assume, maybe I am doing something wrong in the k-fold data preparation or in the training step. Otherwise getting NaN
or very big value
is not expected for this simple problem and simple model.
I have gone through several posts like this, this, and that. Some of them suggested that it could happen because the dataset might contain NaN (but I think MNIST does not contain NaN as directly downloading from the module), model's learning rate is 0.01
(not too big not too small). Moreover, I believe that this post is not duplicated (because here, trying to use k-fold thought the error seems the same).
Any suggestions?