I'm trying to load the whisper large v2 model into a GPU but in order to do that, it seems that pytorch unpickle the whole model using CPU's RAM using more than 10GB of memory, and then it load's it into the GPU memory.
Pytorch's torch.load
documentation also says that
torch.load() uses Python’s unpickling facilities but treats storages, which underlie tensors, specially. They are first deserialized on the CPU and are then moved to the device they were saved from.
Which says that the model unpickling happens in the CPU.
Because this model will be running on the cloud, it seems wrong to pay for a VM with more RAM just so the model can be loaded into the CPU RAM to later be moved to the GPU. Is there a way to load it directly into the GPU without loading the whole model into the CPU's RAM?
I'm currently using whisper's load_model function which is basically doing this:
from whisper import Whisper, ModelDimensions
checkpoint_file = "large-v2.pt"
with open(checkpoint_file, "rb")) as fp:
checkpoint = torch.load(fp, map_location="cuda")
del checkpoint_file
dims = ModelDimensions(**checkpoint["dims"])
model = Whisper(dims)
model.load_state_dict(checkpoint["model_state_dict"])
model.to("cuda")