Fine-tuning TheBloke/Llama-2-13B-chat-GPTQ model with Hugging Face Transformers library throws Exllama error

Question

I am trying to fine-tune the TheBloke/Llama-2-13B-chat-GPTQ model using the Hugging Face Transformers library. I am using a JSON file for the training and validation datasets. However, I am encountering an error related to Exllama backend when I try to run the script.

Here is my code:

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from datasets import load_dataset
import torch

# Check GPU availability
print("Available GPU devices:", torch.cuda.device_count())
print("Name of the first available GPU:", torch.cuda.get_device_name(0))

# Load model and tokenizer
model_name = "TheBloke/Llama-2-13B-chat-GPTQ"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Move the model to GPU
model.to('cuda')

# Load training and validation data
train_data = load_dataset('json', data_files='train_data.jsonl')
val_data = load_dataset('json', data_files='val_data.jsonl')

# Function to format the data
def formatting_func(example):
    return tokenizer(example['input'], example.get('output', ''), truncation=True, padding='max_length')

# Prepare training and validation data
train_data = train_data.map(formatting_func)
val_data = val_data.map(formatting_func)

# Set training arguments
training_args = TrainingArguments(
    output_dir="./output",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,
    save_steps=10_000,
    save_total_limit=2,
)

# Create trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=val_data,
)

# Start training
trainer.train()

# Save the model
model.save_pretrained("./output")

The error message I get is:

ValueError: Found modules on cpu/disk. Using Exllama backend requires all the modules to be on GPU. You can deactivate exllama backend by setting disable_exllama=True in the quantization config object.

I have already moved the model to GPU using model.to('cuda'), but the error persists. Any help would be greatly appreciated.

I tried moving the model to the GPU using model.to('cuda') before initiating the training process, as suggested in the Hugging Face documentation. I also ensured that my environment has all the required packages and dependencies installed. I was expecting the model to fine-tune on my custom JSON dataset without any issues.

However, despite moving the model to the GPU, I still encounter the Exllama backend error. I am not sure why this is happening, as the model should be on the GPU as per my code. I am looking for a way to resolve this error and successfully fine-tune the model on my custom dataset.

score -1 · Answer 1 · answered Aug 31 '23 at 22:07

Based on the error message, it appears that while the model may have been moved to the GPU, some modules (which could be part of the model or part of the data pipeline) are still located on the CPU. This mismatch causes the error when you attempt to use the Exllama backend for quantization.

Here are some steps you could take to troubleshoot:

Check the Device Allocation: Ensure that all components that interact with the model, such as optimizers or additional layers, are also moved to the GPU. Use model.parameters() and loop through them to confirm their device location.
```
for param in model.parameters():
    print(param.device)
```
Data Pipeline: Ensure that your data is also being loaded onto the GPU. If you're using a DataLoader for example, check if the data is being loaded onto the GPU as well.
```
for batch in dataloader:
    # Move batch to the same device as the model
    batch = {k: v.to('cuda') for k, v in batch.items()}
```
Disable Exllama: As a last resort, you could disable the Exllama backend by setting disable_exllama=True in your quantization config object. This may affect the performance and accuracy of your model, but should allow your code to run without this specific error.
Logs and Diagnostics: Sometimes libraries output logs that can give you a clue as to what exactly hasn't been moved to the GPU. You may want to increase the verbosity of logging to see if any additional information is revealed.
Dependency Check: Ensure that the Exllama backend doesn't have additional dependencies that need to be installed or configured to properly recognize GPU resources. Make sure your CUDA toolkit and cuDNN library are up-to-date.
Environment Variables: Occasionally, you may need to set specific environment variables to ensure that the GPU is used. This is typically documented in the library's manual.
Consult Documentation or Community: Since you mentioned Hugging Face, they have an active forum where similar issues are discussed. You may find a solution there.
Explicitly Move Sub-Modules: Sometimes, especially with complex models having sub-modules, a simple .to('cuda') call might not suffice. Try moving each sub-module to the GPU explicitly.
PyTorch Version: Make sure that your PyTorch version is compatible with the Exllama backend. Sometimes backend features are tightly coupled with specific versions of the framework.

By systematically checking each of these factors, you should be able to identify the cause of the problem and take steps to resolve it.

Fine-tuning TheBloke/Llama-2-13B-chat-GPTQ model with Hugging Face Transformers library throws Exllama error

1 Answers1