I'm playing around with the new Llama-2 7B model, and running it on a 16GM RAM M1 pro Mac. If I load the model, Python crashes with a memory error - unless I load it via hf pipelines. I don't believe this to be a hf issue but rather something weird with my machine? Not sure what I'm doing wrong. I have also tried downloading the weights and running it locally - same error.
If I load the model via hf pipelines, such as:
from transformers import AutoTokenizer
import transformers
import torch
model = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
"text-generation",
model=model,
)
sequences = pipeline(
'What's 1+1?',
do_sample=False,
top_k=10,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,
max_length=2000,
)
for seq in sequences:
print(f"Result: {seq['generated_text']}")
that works fine - and although it's quite slow, I can run it.
But, if I try to load the model in any other way, such as:
from ctransformers import AutoModelForCausalLM
llm = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", model_type='llama')
or
from langchain.llms import CTransformers
llm = CTransformers(
model='meta-llama/Llama-2-7b-chat-hf',
model_type='llama',
config={'max_new_tokens': 256,
'temperature': 0.01})
Python crashes and I get a warning like UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
which apparently means that I'm out of memory.
Fine - but a) I'm shutting down everything else, I should have enough RAM on my machine to run the model locally on CPU and b) why can I load the model via hf pipelines?? Any pointers appreciated.