0

This question can be viewed related to my other question.

I tried running multiple machine learning processes in parallel (with bash). These are written using PyTorch. After a certain number of concurrent programs (10 in my case), I get the following error:

RuntimeError: Unable to find a valid cuDNN algorithm to run convolution

As mentioned in this answer,

...it could occur because the VRAM memory limit was hit (which is rather non-intuitive from the error message).

For my case with PyTorch model training, decreasing batch size helped. You could try this or maybe decrease your model size to consume less VRAM.

I tried the solution mentioned here, to enforce a per-process GPU memory usage limit, but this issue persists.

This problem does not occur with a single process, or a fewer number of processes. Since only one context runs at a single time instant, why does this cause memory issue?

This issue occurs with/without MPS. I thought it could occur with MPS, but not otherwise, as MPS may run multiple processes in parallel.

muser
  • 155
  • 1
  • 10
  • Yeah, if you ask for too much memory, a computer may crash. This is not GPU specific, you can also try to allocate a 10000000GB array in your CPU and make your code crash. What is your question? – Ander Biguri Nov 30 '22 at 17:48
  • @AnderBiguri As stated, the problem doesn't occur with a single process of the same nature, but with 10 processes running concurrently. Why does this occur, since the GPU runs only 1 process at a time? – muser Nov 30 '22 at 17:49
  • The GPU is a device purposely designed and built for parallel processing. Why do you think it only does 1 thing at the same time? It will _compute_ one thing at a time, only when that computation is bigger than its processing power, but thats it. Many processes can run on the GPU simultaneously, this is absolutely OK and expected (e.g. you may be running your display and compute, at any time). Check `nvidia-smi` to see all your different processes running at the same time in the GPU. – Ander Biguri Nov 30 '22 at 17:50
  • @AnderBiguri By simultaneously, do you mean parallelly? I understand why display and compute *appear* to be happening parallelly, but they are happening sequentially. – muser Nov 30 '22 at 17:55
  • When the GPU is executing multiple processes (one after the other, for example by pre-emption), is the memory being utilized by multiple processes at the (exact) same time? Even by those that the GPU is not executing at the moment? – muser Nov 30 '22 at 17:57
  • But they don't allocate and deallocate memory sequentially. That would be wasteful. The _compute_ part of the GPU calls are scheduled internally by the GPU, and only run sequentially if each of them fills the compute power of the GPU, otherwise the scheduler will parallelize them simultaneous streams. But if you call 10 processes that have enough compute to fill the GPU, and each takes 1 GB of memory, you will be using 10 GB, not 1GB each at a time. – Ander Biguri Nov 30 '22 at 17:58

1 Answers1

5

Since only one context runs at a single time instant, why does this cause memory issue?

Context-switching doesn't dump the contents of GPU "device" memory (i.e. DRAM) to some other location. If you run out of this device memory, context switching doesn't alleviate that.

If you run multiple processes, the memory used by each process will add up (just like it does in the CPU space) and GPU context switching (or MPS or time-slicing) does not alleviate that in any way.

It's completely expected that if you run enough processes using the GPU, eventually you will run out of resources. Neither GPU context switching nor MPS nor time-slicing in any way affects the memory utilization per process.

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • As usual, Robert has been able to convey with better words what I meant in the comments ;). Thanks. – Ander Biguri Nov 30 '22 at 17:59
  • Thank you. That answers the issue. Are you aware of any solutions to limit this usage (PyTorch or TF specific)? The ones I mentioned in the question don't appear to work. – muser Nov 30 '22 at 17:59
  • @abs Use less memory? Buy a bigger GPU? make sure you read the available GPU specs, and schedule accordingly? – Ander Biguri Nov 30 '22 at 18:00
  • @AnderBiguri Of course those are possible. I specifically asked solutions to limit the usage. – muser Nov 30 '22 at 18:15
  • 1
    There are many many many questions that are PyT or TF specific, here on SO, that ask about how to deal with GPU out of memory situations. I don't have any secrets to share beyond those. As a practical matter, my expectation is that well before you discovered how to go from running 10 training jobs at the same time to running 100 training jobs at the same time on the same GPU, you would run into other performance limits that would make the benefits of adding more jobs disappear. – Robert Crovella Nov 30 '22 at 18:19
  • Agree with Robert. If your single processes fill the GPU compute, call them sequentially, unless you have a serious memory transfer bottleneck problem. – Ander Biguri Nov 30 '22 at 18:21