0

Consider the following script:

import torch

def unnecessary_compute():
    x = torch.randn(1000,1000, device='cuda')
    l = []
    for i in range(5):
        print(i,torch.cuda.memory_allocated())
        l.append(x**i)
unnecessary_compute()

Running this script with PyTorch (1.11) generates the following output:

0 4000256
1 8000512
2 12000768
3 16001024
4 20971520

Given that PyTorch uses asynchronous computation and we never evaluated the contents of l or of a tensor that depends on l, why did PyTorch eagerly allocate GPU memory to the new tensors? Is there a way of invoking these tensors in an utterly lazy way (i.e., without triggering GPU memory allocation before it is required)?

Trisoloriansunscreen
  • 1,543
  • 1
  • 15
  • 27
  • 1
    Are you sure that it shows the GPU memory? Try to ask specifically for the GPU memory via torch.cuda.memory_allocated(device=cuda). Otherwise the allocated memory of the current device is shown. Whatever that is in your case. – tschomacker Jul 19 '22 at 09:37
  • Thank you @tschomacker. I just reran it with torch.cuda.memory_allocated(device='cuda'), the output is the same. – Trisoloriansunscreen Jul 19 '22 at 12:29
  • The linked PyTorch doc just says the operations **might** not be executed when the function returns. – hkchengrex Jul 24 '22 at 05:26
  • I don't get it, since you initialized the tensor on GPU, it should take some memory to store. If you want something as "appear when required", [yield](https://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do) may help. – CuCaRot Jul 25 '22 at 07:04

2 Answers2

2

torch.cuda.memory_allocated() returns the memory that has been allocated, not the memory that has been "used".

In a typical GPU compute pipeline, you would record operations in a queue along with whatever synchronization primitives your API offers. The GPU will then dequeue and execute those operations, respecting the enqueued synchronization primitives. However, GPU memory allocation is not usually an operation which even goes on the queue. Rather, there's usually some sort of fundamental instruction that the CPU can issue to the GPU in order to allocate memory, just as recording operations is another fundamental instruction. This means that the memory necessary for a GPU operation has to be allocated before the operation has even been enqueued; there is no "allocate memory" operation in the queue to synchronize with.

Consider Vulkan as a simple example. Rendering operations are enqueued on a graphics queue. However, memory is typically allocated via calls to vkAllocateMemory(), which does not accept any sort of queue at all; it only accepts the device handle and information about the allocation (size, memory type, etc). From my understanding, the allocation is done "immediately" / synchronously (the memory is safe to use by the time the function call returns on the CPU).

I don't know enough about GPUs to explain why this is the case, but I'm sure there's a good reason. And perhaps the limitations vary from device to device. But if I were to guess, memory allocation probably has to be a fairly centralized operation; it can't be done by just any core executing recorded operations on a queue. This would make sense, at least; the space of GPU memory is usually shared across cores.

Let's apply this knowledge to answer your question: When you call l.append(x**i), you're trying to record a compute operation. That operation will require memory to store the result, and so PyTorch is likely allocating the memory prior to enqueuing the operation. This explains the behavior you're seeing.

However, this doesn't invalidate PyTorch's claims about asynchronous compute. The memory might be allocated synchronously, but it won't be populated with the result of the operation until the operation has been dequeued and completed by the GPU, which indeed happens asynchronously.

Alexander Guyer
  • 2,063
  • 1
  • 14
  • 20
  • So is there any tool to actually get how much GPU memory is being used(as opposed to allocated) on training? – NeoZoom.lua Jan 18 '23 at 01:49
  • @VimNing Not to my knowledge. Unless it's tracked internally by PyTorch, then it's impossible. And PyTorch doesn't expose any sort of interface to access that information, so I doubt it's even tracked. – Alexander Guyer Jan 18 '23 at 15:01
0

I was able to reproduce your problem. I cannot really tell you why it behaves like that. I just think the (randomly) initialized tensor needs a certain amount of memory. For instance if you call x = torch.randn(0,0, device='cuda') the tensor does not allocate any GPU memory and x = torch.zeros(1000,10000, device='cuda') allocates 4000256 as in your example.

To load the tensors lazy, I suggest you create them on CPU and send them on the GPU briefly before using them. Kind of a speeed/memory tradeoff. I changed your code accordingly:

import torch

def unnecessary_compute():
    x = torch.randn(1000,1000, device='cpu')
    l = []
    for i in range(5):
      print(i,torch.cuda.memory_allocated())
      l.append(x**i)
    print("Move to cuda")
    for i, tensor_x in enumerate(l): 
      l[i]=tensor_x.to('cuda')
      print(i,torch.cuda.memory_allocated())
        
unnecessary_compute()

that produced the following output:

0 0
1 0
2 0
3 0
4 0
Move to cuda
0 4000256
1 8000512
2 12000768
3 16001024
4 20971520
tschomacker
  • 631
  • 10
  • 18
  • I'm afraid that in your script, the tensors are never moved to the GPU since you do not collect the output of tensor_x.to('cuda'). If you'll change the middle loop to `for i, tensor_x in enumerate(l): l[i]=tensor_x.to('cuda')`, you'll see that the CUDA memory is allocated. – Trisoloriansunscreen Jul 22 '22 at 16:14
  • @Trisoloriansunscreen You are absolutely right and I changed my script accordingly. My suggestion still remains that you initialize tensors on cpu and move them to cuda when you need them. I do this frequently when I am tokenizing text imput. Answers this approach your question on how to invoke tensors lazily? Or are you looking for something different? – tschomacker Jul 23 '22 at 12:17
  • I'm afraid that this still doesn't achieve (automatic) lazy memory allocation. I'm considering a case in which a certain operation uses a subset of the tensors that were defined. I was assuming (incorrectly) that unevaluated parts of the graph will not be allocated at all, but it seems that this is not the case. – Trisoloriansunscreen Jul 23 '22 at 23:32
  • moving the tensors to cuda as you suggest is a potential solution, but it moves the burden of optimizing memory allocation to the programmer. – Trisoloriansunscreen Jul 23 '22 at 23:35
  • 1
    Ah ok, now I understand your problem more thoroughly. As other commentators pointed out, the behavior you describe is inherent in PyTorch and unfortunately I do not see how your specific goal can be achieved. – tschomacker Jul 26 '22 at 09:59