1

I have 2 GPUs on different computers. One (NVIDIA A100) is on a server, the other (NVIDIA Quadro RTX 3000) is on my laptop. I watch the performance on both machines via nvidia-smi and noticed that the 2 GPUs use different amounts of memory when running the exact same processes (same code, same data, same CUDA version, same pytorch version, same drivers). I created a dummy script to verify this.

import torch
device = torch.device("cuda:0")
a = torch.ones((10000, 10000), dtype=float).to(device)

In nvidia-smi I can see how much memory is used for this specific python script:

  • A100: 1205 MiB
  • RTX 3000: 1651 MiB

However, when I query torch about memory usage I get the same values for both GPUs:

reserved = torch.cuda.memory_reserved(0)
allocated = torch.cuda.memory_allocated(0)

Both systems report the same usage:

  • reserved = 801112064 bytes (763 MiB)
  • allocated = 800000000 bytes (764 MiB)

I note that the allocated amount is much less than what I see used in nvidia-smi, though 763 MiB is equal to 100E6 float64 values.

Why does nvidia-smi report different memory usage on these 2 systems?

tnknepp
  • 5,888
  • 6
  • 43
  • 57
  • is the Quadro rendering your OS? – Matthew Sep 14 '22 at 15:42
  • @Matthew Yes, this is the only card on the laptop, so it is handling everything. However, nvidia-smi breaks down memory usage by each process so I can see how much is used by the python code. I edited my question to reflect that detail. – tnknepp Sep 14 '22 at 15:47
  • Rerun the example with the environment variable `PYTORCH_NO_CUDA_MEMORY_CACHING = 1`. Pytorch will overallocate space and do it's own memory management by default – Carson Sep 14 '22 at 21:48
  • @Carson Sorry, I don't understand how that is set. Also, I think your suggesting will only affect the memory allocation within pytorch and the allocated/reserved amounts are already the same across computers. The issue may be in how nvidia-smi is reporting memory usage. – tnknepp Sep 15 '22 at 16:00

0 Answers0