How can I reduce the amount of (non-GPU) RAM overhead that PyTorch uses when using CUDA?

Question

I'm trying to run some PyTorch models on my Jetson Nano (4GB RAM), but I've learned that PyTorch uses about 2GB of RAM just to initialize anything CUDA related.

I've done some testing (with the help of this GitHub issue), and got the following script running:

import torch
import argparse

parser = argparse.ArgumentParser()
parser.add_argument('size', type=int)
parser.add_argument('--cpu', action='store_true')
args = parser.parse_args()

@profile
def f():
    torch.set_grad_enabled(False)
    torch.cuda._lazy_init()
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    if args.cpu:
        device = 'cpu'
    model = torch.nn.Conv2d(1, 1, 1).to(device)
    x = torch.rand(1, 1, args.size, args.size).to(device)
    y = model(x)

if __name__ == '__main__':
    f()

Which can be run with python3 -m memory_profiler torchmemscript.py 100. Here is the output from that:

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
     9  150.906 MiB  150.906 MiB           1   @profile
    10                                         def f():
    11  150.906 MiB    0.000 MiB           1       torch.set_grad_enabled(False)
    12  155.336 MiB    4.430 MiB           1       torch.cuda._lazy_init()
    13  155.336 MiB    0.000 MiB           1       device = 'cuda' if torch.cuda.is_available() else 'cpu'
    14  155.336 MiB    0.000 MiB           1       if args.cpu:
    15                                                 device = 'cpu'
    16 1889.699 MiB 1734.363 MiB           1       model = torch.nn.Conv2d(1, 1, 1).to(device)
    17 1890.414 MiB    0.715 MiB           1       x = torch.rand(1, 1, args.size, args.size).to(device)
    18 2634.496 MiB  744.082 MiB           1       y = model(x)

So clearly the model is loaded and uses about ~1.7GB of RAM on my Jetson Nano. Running the same script with the --cpu option gives:

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
     9  151.055 MiB  151.055 MiB           1   @profile
    10                                         def f():
    11  151.055 MiB    0.000 MiB           1       torch.set_grad_enabled(False)
    12  155.359 MiB    4.305 MiB           1       torch.cuda._lazy_init()
    13  155.359 MiB    0.000 MiB           1       device = 'cuda' if torch.cuda.is_available() else 'cpu'
    14  155.359 MiB    0.000 MiB           1       if args.cpu:
    15  155.359 MiB    0.000 MiB           1           device = 'cpu'
    16  157.754 MiB    2.395 MiB           1       model = torch.nn.Conv2d(1, 1, 1).to(device)
    17  157.754 MiB    0.000 MiB           1       x = torch.rand(1, 1, args.size, args.size).to(device)
    18  160.051 MiB    2.297 MiB           1       y = model(x)

Is there a way to reduce this overhead? In the GitHub issue there are mentions of compiling pytorch without all the CUDA kernels to reduce the RAM overhead, but I'm unsure which compile options I will need and which ones will actually reduce RAM overhead.

Is there a known way to reduce the RAM usage by PyTorch?

How can I reduce the amount of (non-GPU) RAM overhead that PyTorch uses when using CUDA?

0 Answers0