3

I'm trying to run a test code on GPU of a remote machine. The code is

import torch

foo = torch.tensor([1,2,3])
foo = foo.to('cuda')

I'm getting the following error

Traceback (most recent call last):
  File "/remote/blade/test.py", line 3, in <module>
    foo = foo.to('cuda')
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

From this discussion, the conflict between cuda and pytorch versions may be the cause for the error. I run the following

print('python v. : ', sys.version)
print('pytorch v. :', torch.__version__)
print('cuda v. :', torch.version.cuda)

to get the versions:

python v. : 3.9.7 (default, Sep 16 2021, 13:09:58)
[GCC 7.5.0]
pytorch v. : 1.11.0.dev20211206
cuda v. : 10.2

Does anything here look off?

Blade
  • 984
  • 3
  • 12
  • 34
  • The Pytorch build you are using requires CUDA 10.2 – talonmies Dec 07 '21 at 21:03
  • @talonmies Is this a question or a statement? b/c you see that I have tried w/ 10.2 with no luck. If this is a question, I have no preference on cuda version. PS why you think this question should not have cuda tag? – Blade Dec 07 '21 at 21:06
  • 1
    It is statement. `torch.version.cuda` is a hard coded string which emitted by the Pytorch build. It must match a set of runtime libraries accessible in the default library search path. And your PyTorch problems aren’t a CUDA programming related question, which is why I have removed the tag – talonmies Dec 07 '21 at 21:10
  • Thanks for clarifying. So I removed the "EDIT:" section. Still, the problem remains with pytorch v. : 1.11.0.dev20211206 and cuda v. : 10.2. Is there anything else that I can check? – Blade Dec 07 '21 at 21:16
  • @Blade Did you solve it? – Minions Jan 19 '22 at 17:07
  • @Minions I added an answer for you. Hope this helps. – Blade Jan 19 '22 at 17:21

1 Answers1

0

To answer the comments that asked if I was able to address the issue:

I had this issue in two separate occasions,

  1. First time, I was trying to use conda libraries while I had python packages in another directory as well (probably installed using pip). I ended up manually removing the other library.

  2. The second time the issue arose from zombie processes: essentially I had prematurely terminated the code and hence the GPU memory was not emptied. The solution was to run

    ps -elf | grep python
    

    and then kill the processes using

    kill -9 [pid]
    

    where [pid] is the process id returned after the first command.

Blade
  • 984
  • 3
  • 12
  • 34
  • Thanks @Blade! so you just killed all the python processes? – Minions Jan 19 '22 at 17:30
  • @Minions Yeah. Basically, I was not running anything but it was showing python processes, that's why they call them zombie processes. They just occupied memory. – Blade Jan 19 '22 at 17:34
  • 1
    Ah ok. In my case I don't have any running processes, but thanks for your answer :) – Minions Jan 19 '22 at 17:47