1

I use something similar to here to run Llama 2.

from os.path import dirname
from transformers import LlamaForCausalLM, LlamaTokenizer
import torch 

model = "/Llama-2-70b-chat-hf/"
# model = "/Llama-2-7b-chat-hf/"

tokenizer = LlamaTokenizer.from_pretrained(dirname(model))  

model = LlamaForCausalLM.from_pretrained(dirname(model)) 

eval_prompt = """
Summarize this dialog:
A: Hi Tom, are you busy tomorrow’s afternoon?
B: I’m pretty sure I am. What’s up?
A: Can you go with me to the animal shelter?.
B: What do you want to do?
A: I want to get a puppy for my son.
B: That will make him so happy.
A: Yeah, we’ve discussed it many times. I think he’s ready now.
B: That’s good. Raising a dog is a tough issue. Like having a baby ;-) 
A: I'll get him one of those little dogs.
B: One that won't grow up too big;-)
A: And eat too much;-))
B: Do you know which one he would like?
A: Oh, yes, I took him there last Monday. He showed me one that he really liked.
B: I bet you had to drag him away.
A: He wanted to take it home right away ;-).
B: I wonder what he'll name it.
A: He said he’d name it after his dead hamster – Lemmy  - he's  a great Motorhead fan :-)))
---
Summary:
"""

model_input = tokenizer(eval_prompt, return_tensors="pt")   

model.eval()
with torch.no_grad():
    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=100)[0], skip_special_tokens=True))

The 7b version outputs an answer. But the 70b version loads the shards, and gets an error after that. The size_mismatch part here repeats many times (with different weights).

Loading checkpoint shards: 100%|███████████████████████████████████████████████| 15/15 [11:56<00:00, 47.78s/it]
Traceback (most recent call last):
  File "/llama2.py", line 52, in <module>
    model = LlamaForCausalLM.from_pretrained(dirname(model))
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/miniconda3/envs/llama2/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2795, in from_pretrained
    ) = cls._load_pretrained_model(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/miniconda3/envs/llama2/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3173, in _load_pretrained_model
    raise RuntimeError(f"Error(s) in loading state_dict for {model.__class__.__name__}:\n\t{error_msg}")
RuntimeError: Error(s) in loading state_dict for LlamaForCausalLM:
    size mismatch for model.layers.0.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 8192]) from checkpoint, the shape in current model is torch.Size([8192, 8192]).
    size mismatch for model.layers.0.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 8192]) from checkpoint, the shape in current model is torch.Size([8192, 8192]).

You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.


I get another error from ignoring mismatched sizes KeyError: 'lm_head.weight'. But if it runs with 7b, why not with 70b?

Edit: The RAM requirements are over 100 GB of RAM, but I have a few times as much as that. I have 12 MB of vram.

user14094230
  • 278
  • 2
  • 9

2 Answers2

0

Insufficient hardware

you didn't mention anything about the hardware you run it on, so I can only assume this is a classic case for insufficient hardware. as a rule of thumb you need to have at least 1GB of RAM (preferably VRAM depends on architecture) for every billion model parameters.

with a 70b model you should have 70GB vram (or a unified ram) which usually means 96GB in practice

Nir O.
  • 1,563
  • 1
  • 17
  • 26
  • *1GB of RAM for every billion model parameters*, interesting. Any sources or hard-earned experience? :') – doneforaiur Aug 14 '23 at 13:44
  • I have hundreds of GB in total RAM. I'm not using a GPU, it's not sent to cuda. – user14094230 Aug 14 '23 at 14:08
  • 1
    @doneforaiur saw the statement on reddit and it matched exactly what I experienced with various configurations and models - and I tried many of them – Nir O. Aug 14 '23 at 15:51
0

exampleTo run the LAMa2 model with quantization, you can use the following steps:

Install the quantization library. You can do this by following the instructions in the Hugging Face documentation: https://huggingface.co/docs/optimum/concept_guides/quantization.

Quantize the model. You can do this by running the following command: optimum-quantize --model lama2 --precision int8

Export the quantized model. You can do this by running the following command: optimum-export --model lama2-int8 --framework pytorch Run the quantized model. You can do this by loading the model in PyTorch and then calling the forward() method. Here is an example of how to run the quantized LAMa2 model: