0

I'm currently training some neural network models and I've found that for some reason the model will sometimes fail before ~200 iterations due to a runtime error, despite there being memory available. The error is:

RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 10.76 GiB total capacity; 1.79 GiB already allocated; 3.44 MiB free; 9.76 GiB reserved in total by PyTorch)

Which shows how only ~1.8GB of RAM is being used when there should be 9.76GB available.

I have found that when I find a good seed (just by random searching), and the model gets past the first few hundred iterations, it will generally run fine afterwards. It seems as though the model doesn't have as much memory available very early on in training, but I don't know how to solve this.

7koFnMiP
  • 472
  • 3
  • 17
  • 3
    Try to monitor the GPU allocation while the training is running using e.g. `watch -n 0.5 nvidia-smi`. You will likely see the GPU memory usage growing beyond your limit. I also recommend calling `torch.cuda.reset_peak_memory_stats()` before/after training. If you want to dig deeper, this might be relevant: https://github.com/pytorch/pytorch/issues/35901 – adeelh Aug 03 '21 at 11:30
  • 1
    Are you fine tuning a model? Try reducing the number of layers you're training to see if a particular part of your architecture is causing problems. Training from scratch? Try increasing the drop out rate. I doubt these specific recommendations will solve your problem directly but you may gain more insight as to what is contributing to the increasing memory footprint. Just an idea – VanBantam Aug 06 '21 at 01:20
  • For me, the above error often demands that the batch-size be reduced (especially for computer-vision or other large matrices). – Arnab De Aug 06 '21 at 16:40
  • I don't think it's a batch size problem as it isn't really a memory issue insofar as the model trains fine after the first few iterations – 7koFnMiP Aug 06 '21 at 17:09
  • Where are you running this code? locally or cloud service? – pygeek Aug 07 '21 at 19:46
  • On a node on a GPU cluster – 7koFnMiP Aug 08 '21 at 09:24

1 Answers1

0

It's worth noting this part of your error 9.76 GiB reserved in total by PyTorch meaning that this memory is not necessarily available. I have had a similar issue before and I would try to empty the cache using torch.cuda.empty_cache(). Potentially you could also try torch.cuda.clear_memory_allocated() to clear the allocated memory. Afterward use the nvidia-smi CLI to test this. A common issue with maxing out on memory is the batch size. I tend to use this method to calculate a reasonable batch size https://stackoverflow.com/a/59923608/10148950.

There is also ways to use the PyTorch library to investigate the memory usage as per this answer: https://stackoverflow.com/a/58216793/10148950

BCoxford
  • 31
  • 3