2

i am having issues clearing out the GPU memory after loading the Llama 2 model into the pipeline.

clearing out the GPU memory works fine on other models (ie del variables, torch.cuda.empty_cache()), but it seems it doesn't work when using the Llama 2 model.

i tested this on my ubuntu 22 PC, and also on google colab with GPU, and the behaviour is consistent. if i instantiate the tokenizer and model, and subsequently delete them, the GPU memory is cleared. but if i instantiate the pipeline as well, and subsequently delete it, the GPU memory remains. sample code below:

from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
import torch
import gc

modelId = "meta-llama/Llama-2-7b-chat-hf"

model = AutoModelForCausalLM.from_pretrained(modelId, device_map="auto", torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained(modelId)

## if pipeline isnt instantiated, the GPU memory is released upon model del
## if pipeline is instantiated, del pipeline doesnt release GPU memory!
pipeline = pipeline(task="text-generation", model=model, tokenizer=tokenizer)

## clearing out GPU memory
del model
del tokenizer
del pipeline
gc.collect()
torch.cuda.empty_cache() 

has anyone else experienced this? any insight or guidance would be very much appreciated.

thanks

malaccan
  • 61
  • 4

0 Answers0