0

I have finetuned Llama model using low-rank adaptation (LoRA), based on peft package. The result files adapter_config.json and adapter_model.bin are saved.

I can load fine-tuned model from huggingface by using the following codes:

model = LlamaForCausalLM.from_pretrained(<model_name>,
                                            torch_dtype=torch.float16,
                                            device_map='auto', 
                                            llm_int8_enable_fp32_cpu_offload=True
                                            )
peft_model_id = <hub_model_name>
peft_model = PeftModelForCausalLM.from_pretrained(model, peft_model_id)

If I want to directly load the fine-tuned model by using the local files adapter_config.json and adapter_model.bin (instead of push them to hub), how to make it?

Thanks in advance!

Youssif Saeed
  • 11,789
  • 4
  • 44
  • 72
a7777777
  • 1
  • 1

1 Answers1

0

It is fairly similar to how you have it set up for models from huggingface. The main part is to get the local path to original model used. This can be done by creating a PeftConfig object using the local path to finetuned Peft Model (the folder where your adapter_config.json file and all of the finetuned weights are). If you used any sort of quantizations like 4bit finetuning using bitsandbytes you will want to set up a config object for those as well and pass them to the model. Here is an example script:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftConfig, PeftModel

#Set Path to folder that contains adapter_config.json and the associated .bin files for the Peft model
peft_model_id = '/path/to/local/peft_model_folder'

#Get PeftConfig from the finetuned Peft Model. This config file contains the path to the base model
config = PeftConfig.from_pretrained(model_id)

# If you quantized the model while finetuning using bits and bytes 
# and want to load the model in 4bit for inference use the following code.
# NOTE: Make sure the quant and compute types match what you did during finetuning
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,

)
###

#Load the base model - if you are not using the bnb_config then remove the quantization_config argument
#You may or may not need to set use_auth_token to True depending on your model.
model = AutoModelForCausalLM.from_pretrained(
    config.base_model_name_or_path,
    quantization_config=bnb_config,
    use_auth_token=True,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
tokenizer.pad_token = tokenizer.eos_token

# Load the Peft/Lora model
model = PeftModel.from_pretrained(model, peft_model_id)
mbwx
  • 1