Starcoder - Why NVIDIA Tesla T4 GPU switching is not happening causing OutOfMemoryError?

Question

In order to fine tune Starcoder LLM model on my GCP instance, I have setup 4 NVIDIA Tesla T4 GPUs (16GB each)

I installed nvitop to monitor the usage of the GPUs while finetuning.

I have also installed the CUDA toolkit on the VM. (checked if it's installed using nvcc --version)

The problem is that all the computation is currently happening in 1 GPU instance only (GPU0), Which is why when the model requires more than 16GB, it gives a CUDA OutofMemory Error.

How do I ensure that all 4 GPUs are load balanced ? Is there any additional configuration that needs to be done at VM level ?

I'm new to this, please provide assistance. Thanks in advance

OutOfMemoryError: CUDA out of memory. Tried to allocate 144.00 MiB (GPU 0; 14.62 GiB total capacity; 13.16 GiB already allocated;
103.38 MiB free; 13.96 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb
to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF```

In order to load balance you need to reduce the `batch_size` by following this [Blog](https://medium.com/@snk.nitin/how-to-solve-cuda-out-of-memory-error-850bb247cfb2) by Nitin Kishore. For more information you can also refer [how-to-avoid-cuda-out-of-memory-in-pytorch](https://stackoverflow.com/questions/59129812/how-to-avoid-cuda-out-of-memory-in-pytorch), [stack2](https://stackoverflow.com/questions/54374935/) — Hemanth Kumar, May 28 '23 at 11:08
The current batch_size is 1. I tried gc.collect() and torch.cuda.empty_cache(). Nothing out of this worked. I can see that the model is consuming all the 16GB of 1 GPU and then correctly gives the out of memory. How to allow the model to run on other available GPUs when the current GPU memory is fully used ? — Aadesh, May 28 '23 at 15:12
Can you check this official [doc1](https://www.tensorflow.org/guide/gpu#using_multiple_gpus) and [Doc2](https://www.tensorflow.org/guide/distributed_training) might give you some insight on this. — Hemanth Kumar, May 29 '23 at 04:40
let me know whether the shared info was helpful. I am happy to assist if you have any further queries — Hemanth Kumar, May 31 '23 at 05:33

Starcoder - Why NVIDIA Tesla T4 GPU switching is not happening causing OutOfMemoryError?

0 Answers0