Getting NaN value from loss function for k-fold validation

Question

I am trying to implement MNIST using PyTorch Lightning. Here, I wanted to use k-fold cross-validation.

The problem is I am getting the NaN value from the loss function (for at least 1 fold). From below 3rd time, I was getting NaN values from the loss function.

Epoch 19: 100%|█████████████████████████████████| 110/110 [00:03<00:00, 29.24it/s, loss=0.963, v_num=287]
Testing: 100%|███████████████████████████████████████████████████████████| 28/28 [00:00<00:00, 39.94it/s]

Epoch 19: 100%|█████████████████████████████████| 110/110 [00:04<00:00, 25.69it/s, loss=0.825, v_num=288]
Testing: 100%|███████████████████████████████████████████████████████████| 28/28 [00:00<00:00, 41.19it/s]

Epoch 19: 100%|███████████████████████████████████| 110/110 [00:03<00:00, 30.19it/s, loss=nan, v_num=289]
Testing: 100%|███████████████████████████████████████████████████████████| 28/28 [00:00<00:00, 42.15it/s

Or very big loss value (terminated before completing full epocs)

Epoch 0:  44%|█████████████▉                  | 48/110 [00:02<00:02, 22.87it/s, loss=2.08e+23, v_num=295]

The code I have used for data preparation, k-fold, and trainer is given below

def prepare_data():
  transform=transforms.Compose([transforms.ToTensor(), 
                                transforms.Normalize((0.1307,), (0.3081,))])
  mnist_train = MNIST(os.getcwd(), train=True, download=True, transform=transform)
  mnist_test = MNIST(os.getcwd(), train=False, download=True, transform=transform)
  dataset = ConcatDataset([mnist_train, mnist_test])
  return dataset


k_folds=5
epochs=20

kfold=KFold(n_splits=k_folds,shuffle=True)

dataset = prepare_data()
model = LightningMNIST(lr_rate=0.01)

for fold, (train_idx, val_idx) in enumerate(kfold.split(dataset)):
  train_subsampler = torch.utils.data.SubsetRandomSampler(train_idx)
  val_subsampler = torch.utils.data.SubsetRandomSampler(val_idx)

  train_loader = torch.utils.data.DataLoader(dataset, num_workers=8, batch_size=512, sampler=train_subsampler)
  val_loader = torch.utils.data.DataLoader(dataset, num_workers=8, batch_size=512, sampler=val_subsampler)
  model.apply(reset_weights) # reset model for every fold
  early_stopping = EarlyStopping('train_loss', mode='min', patience=5)
  model_checkpoint = ModelCheckpoint(dirpath=model_path+'mnist_{epoch}-{train_loss:.2f}',
                                        monitor='train_loss', mode='min', save_top_k=3)
  trainer = pl.Trainer(max_epochs=epochs, profiler=False, callbacks = [model_checkpoint],default_root_dir=model_path) 
  trainer.fit(model, train_dataloader=train_loader)
  trainer.test(test_dataloaders=val_loader, ckpt_path=None)

The training step is given below

  def training_step(self, train_batch, batch_idx):
    x, y = train_batch
    logits = self.forward(x)
    loss = self.error_loss(logits.squeeze(-1), y.float())
    self.log('train_loss', loss)
    return {'loss': loss}

I assume, maybe I am doing something wrong in the k-fold data preparation or in the training step. Otherwise getting NaN or very big value is not expected for this simple problem and simple model.

I have gone through several posts like this, this, and that. Some of them suggested that it could happen because the dataset might contain NaN (but I think MNIST does not contain NaN as directly downloading from the module), model's learning rate is 0.01 (not too big not too small). Moreover, I believe that this post is not duplicated (because here, trying to use k-fold thought the error seems the same).

Any suggestions?

Can you provide a [minimal, reproducible example](https://stackoverflow.com/help/minimal-reproducible-example)? — aretor, Jan 19 '22 at 08:08
I would also try the following: (a) set a seed and a conditional debug to stop where you get the `NaN` value, then inspect your structures starting from the loss inputs. (b) try to run your script without K-fold, to understand if it is responsible of the error. — aretor, Jan 19 '22 at 08:08
@aretor thanks for the suggestion. Using seed is a good idea. Would you mind checking the full code in here (https://filebin.net/m2masi1r0mhhaeh1)? — Opps_0, Jan 19 '22 at 12:18
@aretor I tried several times without k-fold and the model was working fine (no error or no NaN) — Opps_0, Jan 19 '22 at 12:31
I tested the code on Google Colab and the train loss does not report `NaN`, maybe it is related to your environment? Perhaps try to use one of PyTorch's `torchvision` models. — aretor, Jan 20 '22 at 13:52
@aretor thank you for your time and I am glad that you tested the file. However, my `k-fold` implementation logic is fine (or implemented wrongly)? — Opps_0, Jan 20 '22 at 18:45
As far as I am concerned, I don't see any evident error. Thus, I can only suggest you simplify your code until you find which combination is raising the error. Check [this](https://pytorch.org/docs/stable/notes/randomness.html) for better reproducibility. — aretor, Jan 21 '22 at 09:45

Getting NaN value from loss function for k-fold validation

0 Answers0