1

How I can fix this:

I writed code for training GPT-2 on dataset by Hugging Face, but I have an error and don't know why I got this error:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-7-b3178137a672> in <cell line: 17>()
     15 )
     16 
---> 17 trainer.train()
     18 model.save_pretrained('/content/drive/MyDrive/MyGPT')

11 frames
/usr/local/lib/python3.10/dist-packages/datasets/formatting/formatting.py in _check_valid_index_key(key, size)
    524     if isinstance(key, int):
    525         if (key < 0 and key + size < 0) or (key >= size):
--> 526             raise IndexError(f"Invalid key: {key} is out of bounds for size {size}")
    527         return
    528     elif isinstance(key, slice):

IndexError: Invalid key: 409862 is out of bounds for size 0

in Hugging Face Transformers Google Colab code here:

!pip install 'transformers[torch]'

!pip install datasets
from datasets import load_dataset

dataset = load_dataset("Nan-Do/instructional_code-search-net-python")

import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments
tokenizer = GPT2Tokenizer.from_pretrained('sberbank-ai/rugpt3large_based_on_gpt2')
model = GPT2LMHeadModel.from_pretrained('sberbank-ai/rugpt3large_based_on_gpt2')

def prepare_data(data):
    input_ids = []
    attention_masks = []

    for text in data:
        encoded = tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=512,
            pad_to_max_length=True,
            return_attention_mask=True,
            return_tensors='pt'
        )

        input_ids.append(encoded['input_ids'])
        attention_masks.append(encoded['attention_mask'])

    return {
        'input_ids': torch.cat(input_ids, dim=0),
        'attention_mask': torch.cat(attention_masks, dim=0)
    }


training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=8000,
    per_device_train_batch_size=2,
    save_steps=2000,
    save_total_limit=2,
    prediction_loss_only=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset['train'],
    data_collator=prepare_data,
)

trainer.train()
model.save_pretrained('/content/drive/MyDrive/MyGPT')

I tried to add optimizer = TorchAdamW(model.parameters(), lr=1e-3) from torch.optim import AdamW as TorchAdamW By ChatGPT's advice, but this didn't help. And I searched the internet but didn't find the answer to solve this error.

Vovancho
  • 11
  • 2

0 Answers0