3

I'm playing around with the new Llama-2 7B model, and running it on a 16GM RAM M1 pro Mac. If I load the model, Python crashes with a memory error - unless I load it via hf pipelines. I don't believe this to be a hf issue but rather something weird with my machine? Not sure what I'm doing wrong. I have also tried downloading the weights and running it locally - same error.

If I load the model via hf pipelines, such as:

from transformers import AutoTokenizer
import transformers
import torch

model = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model)

pipeline = transformers.pipeline(
    "text-generation",
    model=model,
)

sequences = pipeline(
    'What's 1+1?',
    do_sample=False,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_length=2000,
)

for seq in sequences:
    print(f"Result: {seq['generated_text']}")

that works fine - and although it's quite slow, I can run it.

But, if I try to load the model in any other way, such as:

from ctransformers import AutoModelForCausalLM

llm = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", model_type='llama')

or

from langchain.llms import CTransformers
llm = CTransformers(
    model='meta-llama/Llama-2-7b-chat-hf',
    model_type='llama',
    config={'max_new_tokens': 256,
            'temperature': 0.01})

Python crashes and I get a warning like UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown which apparently means that I'm out of memory.

Fine - but a) I'm shutting down everything else, I should have enough RAM on my machine to run the model locally on CPU and b) why can I load the model via hf pipelines?? Any pointers appreciated.

0 Answers0