Getting "RuntimeError: CUDA error: out of memory" when memory is free

Question

I'm trying to run a test code on GPU of a remote machine. The code is

import torch

foo = torch.tensor([1,2,3])
foo = foo.to('cuda')

I'm getting the following error

Traceback (most recent call last):
  File "/remote/blade/test.py", line 3, in <module>
    foo = foo.to('cuda')
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

From this discussion, the conflict between cuda and pytorch versions may be the cause for the error. I run the following

print('python v. : ', sys.version)
print('pytorch v. :', torch.__version__)
print('cuda v. :', torch.version.cuda)

to get the versions:

python v. : 3.9.7 (default, Sep 16 2021, 13:09:58)
[GCC 7.5.0]
pytorch v. : 1.11.0.dev20211206
cuda v. : 10.2

Does anything here look off?

@talonmies Is this a question or a statement? b/c you see that I have tried w/ 10.2 with no luck. If this is a question, I have no preference on cuda version. PS why you think this question should not have cuda tag? — Blade, Dec 07 '21 at 21:06
It is statement. `torch.version.cuda` is a hard coded string which emitted by the Pytorch build. It must match a set of runtime libraries accessible in the default library search path. And your PyTorch problems aren’t a CUDA programming related question, which is why I have removed the tag — talonmies, Dec 07 '21 at 21:10
Thanks for clarifying. So I removed the "EDIT:" section. Still, the problem remains with pytorch v. : 1.11.0.dev20211206 and cuda v. : 10.2. Is there anything else that I can check? — Blade, Dec 07 '21 at 21:16

score 0 · Answer 1 · answered Jan 19 '22 at 17:21

0

To answer the comments that asked if I was able to address the issue:

I had this issue in two separate occasions,

First time, I was trying to use conda libraries while I had python packages in another directory as well (probably installed using pip). I ended up manually removing the other library.
The second time the issue arose from zombie processes: essentially I had prematurely terminated the code and hence the GPU memory was not emptied. The solution was to run
```
ps -elf | grep python
```
and then kill the processes using
```
kill -9 [pid]
```
where [pid] is the process id returned after the first command.

answered Jan 19 '22 at 17:21

Blade

984
3
12
34

Thanks @Blade! so you just killed all the python processes? – Minions Jan 19 '22 at 17:30
@Minions Yeah. Basically, I was not running anything but it was showing python processes, that's why they call them zombie processes. They just occupied memory. – Blade Jan 19 '22 at 17:34
1

Ah ok. In my case I don't have any running processes, but thanks for your answer :) – Minions Jan 19 '22 at 17:47

Getting "RuntimeError: CUDA error: out of memory" when memory is free

1 Answers1