1

I have tried to use tensorboard pytorch to plot loss graphs and accuracy. I have used Summary Writer here.

But, as I run my training loop, I face the following error at the end of first epoch.

Following snippet shows the error:enter image description here

Following is the python code containing summary writer for reference:

loss_fn = torch.nn.BCELoss()
lr = 1e-3
optimizer = optim.SGD(model.parameters(), lr=lr)


scheduler = lr_scheduler.StepLR(optimizer, 8, gamma=0.1, last_epoch=-1)
n_epochs = 20
log_interval = 1

writer = SummaryWriter()

fit(train_loader, test_loader, model, loss_fn, optimizer, scheduler, n_epochs, cuda, log_interval, [metrics.Binary_accuracy()], writer)
writer.close()
def fit(train_loader, val_loader, model, loss_fn, optimizer, scheduler, n_epochs, cuda, log_interval, metrics, writer,
        start_epoch=0, message=None) -> object:


    for epoch in range(0, start_epoch):
        scheduler.step()

    for epoch in range(start_epoch, n_epochs):
        scheduler.step()

        # Train stage
        train_loss, metrics = train_epoch(train_loader, model, loss_fn, optimizer, cuda, log_interval, metrics, writer)
        #log_data['train_loss'].append(train_loss)

        writer.add_scalar('loss/train',train_loss,epoch)

        message = 'Epoch: {}/{}. Train set: Average loss: {:.4f}'.format(epoch + 1, n_epochs, train_loss)
        for metric in metrics:
            message += '\t{}: {}'.format(metric.name(), metric.value())
            writer.add_scalar(f'{metric.name()}/train', metric.value(), epoch)

        val_loss, metrics = test_epoch(val_loader, model, loss_fn, cuda, metrics, log_interval)
        val_loss /= len(val_loader)
        writer.add_scalar(f'loss/test', val_loss, epoch)

        message += '\nEpoch: {}/{}. Test set: Average loss: {:.4f}'.format(epoch + 1, n_epochs,
                                                                                 val_loss)
        for metric in metrics:
            message += '\t{}: {}'.format(metric.name(), metric.value())
            writer.add_scalar(f'{metric.name()}/test', metric.value(), epoch)

        print(message)
        writer.flush()

I also want to plot confusion matrix in tensorboard, please can I have some insights in this as well.

Any help is most appreciated! Thank you in advance

aarya
  • 83
  • 1
  • 8
  • Since the error happened in `load_sequence`, I'm wondering why you chose to show is THIS code. The implication here is that `load_sequence` is opening image files and never closing them, just as the error says. – Tim Roberts Aug 14 '21 at 17:30
  • You can try to use `Process Explorer` to see which files are open. – MegaIng Aug 14 '21 at 17:37
  • Sounds like an OS limitation, have you tried to change the max number of open files? https://stackoverflow.com/questions/6774724/why-python-has-limit-for-count-of-file-handles/28212496 – user107511 Aug 14 '21 at 22:08
  • 2
    Thanks for your comments. The error is solved as I tried to append all losses to a list. And not after every epoch but after entire training the losses and accuracy is written to the summary writer, which escaped the error of too many open files. – aarya Aug 16 '21 at 12:41
  • See: [Too many open files under RustBoard (EMFILE)](https://github.com/tensorflow/tensorboard/issues/4955). – kenorb Dec 20 '21 at 00:41

0 Answers0