0

The following is my hardware makeup:

!nvidia-smi
Tue Nov 15 08:49:04 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.60.02    Driver Version: 510.60.02    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro RTX 4000     On   | 00000000:81:00.0 Off |                  N/A |
| 44%   32C    P8     9W / 125W |    159MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2063      G                                      63MiB |
|    0   N/A  N/A   1849271      C                                      91MiB |
+-----------------------------------------------------------------------------+

!free -h
              total        used        free      shared  buff/cache   available
Mem:            64G        677M         31G         10M         32G         63G
Swap:            0B          0B          0B

As you can see, I got plenty of CUDA memory and hardly any of it is used. This is the error that I am getting:

Traceback (most recent call last):
  File "main.py", line 834, in <module>
    raise err
  File "main.py", line 816, in <module>
    trainer.fit(model, data)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 771, in fit
    self._call_and_handle_interrupt(
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 722, in _call_and_handle_interrupt
    return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
    return function(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 812, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1218, in _run
    self.strategy.setup(self)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 162, in setup
    self.model_to_device()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 324, in model_to_device
    self.model.to(self.root_device)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/core/mixins/device_dtype_mixin.py", line 121, in to
    return super().to(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 927, in to
    return self._apply(convert)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 579, in _apply
    module._apply(fn)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 579, in _apply
    module._apply(fn)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 579, in _apply
    module._apply(fn)
  [Previous line repeated 4 more times]
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 602, in _apply
    param_applied = fn(param)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 925, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 7.80 GiB total capacity; 6.70 GiB already allocated; 12.44 MiB free; 6.80 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Using the following code reduced the Tried to allocate section from 146MB to 20MB:

import torch
from GPUtil import showUtilization as gpu_usage
from numba import cuda

def free_gpu_cache():
    print("Initial GPU Usage")
    gpu_usage()                             

    torch.cuda.empty_cache()

    cuda.select_device(0)
    cuda.close()
    cuda.select_device(0)

    print("GPU Usage after emptying the cache")
    gpu_usage()

free_gpu_cache()

Where am I going wrong?

talonmies
  • 70,661
  • 34
  • 192
  • 269
Onur-Andros Ozbek
  • 2,998
  • 2
  • 29
  • 78
  • Does this answer your question? [How to fix this strange error: "RuntimeError: CUDA error: out of memory"](https://stackoverflow.com/questions/54374935/how-to-fix-this-strange-error-runtimeerror-cuda-error-out-of-memory) – YesThatIsMyName Nov 15 '22 at 09:27
  • @YesThatIsMyName No it doesn't because I have plenty of available CUDA memory so it shouldn't run out. – Onur-Andros Ozbek Nov 15 '22 at 09:35
  • And you tried all of the solutions/suggestions in the above question? Also have a look at the pytorch documentation https://pytorch.org/docs/stable/notes/faq.html . – YesThatIsMyName Nov 15 '22 at 09:51
  • *"No it doesn't because I have plenty of available CUDA memory so it shouldn't run out"*, yes... you have plenty of CUDA memory just until you run out of memory. Out of memory error are generally either caused by the data/model being too big or a memory leak happening in your code. In those cases `free_gpu_cache` will not help in any way. Please provide the relevant code (i.e. your training loop) if you want us to dig further down in this. – Ivan Nov 15 '22 at 10:09
  • @Ivan This is the training code: https://github.com/justinpinkney/stable-diffusion/blob/main/main.py I am posting the repo because the code exceeds the limit of the stackoverflow post.. – Onur-Andros Ozbek Nov 15 '22 at 10:29
  • Why do you assume having enough memory to run this model? – Ivan Nov 15 '22 at 10:59
  • @Ivan Check out my nvidia-smi – Onur-Andros Ozbek Nov 15 '22 at 17:35
  • Yes, but it all depends on when you are actually calling the command. There might be a point at the beginning (oversized data) or during training (leak) when the GPU goes out of memory... Do you see what I mean? – Ivan Nov 16 '22 at 08:43

0 Answers0