0

In order to fine tune Starcoder LLM model on my GCP instance, I have setup 4 NVIDIA Tesla T4 GPUs (16GB each)

I installed nvitop to monitor the usage of the GPUs while finetuning.

I have also installed the CUDA toolkit on the VM. (checked if it's installed using nvcc --version)

The problem is that all the computation is currently happening in 1 GPU instance only (GPU0), Which is why when the model requires more than 16GB, it gives a CUDA OutofMemory Error.

How do I ensure that all 4 GPUs are load balanced ? Is there any additional configuration that needs to be done at VM level ?

I'm new to this, please provide assistance. Thanks in advance

OutOfMemoryError: CUDA out of memory. Tried to allocate 144.00 MiB (GPU 0; 14.62 GiB total capacity; 13.16 GiB already allocated;
103.38 MiB free; 13.96 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb
to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF```
Aadesh
  • 403
  • 3
  • 13
  • In order to load balance you need to reduce the `batch_size` by following this [Blog](https://medium.com/@snk.nitin/how-to-solve-cuda-out-of-memory-error-850bb247cfb2) by Nitin Kishore. For more information you can also refer [how-to-avoid-cuda-out-of-memory-in-pytorch](https://stackoverflow.com/questions/59129812/how-to-avoid-cuda-out-of-memory-in-pytorch), [stack2](https://stackoverflow.com/questions/54374935/) – Hemanth Kumar May 28 '23 at 11:08
  • The current batch_size is 1. I tried gc.collect() and torch.cuda.empty_cache(). Nothing out of this worked. I can see that the model is consuming all the 16GB of 1 GPU and then correctly gives the out of memory. How to allow the model to run on other available GPUs when the current GPU memory is fully used ? – Aadesh May 28 '23 at 15:12
  • Can you check this official [doc1](https://www.tensorflow.org/guide/gpu#using_multiple_gpus) and [Doc2](https://www.tensorflow.org/guide/distributed_training) might give you some insight on this. – Hemanth Kumar May 29 '23 at 04:40
  • let me know whether the shared info was helpful. I am happy to assist if you have any further queries – Hemanth Kumar May 31 '23 at 05:33

0 Answers0