How I can fix this:
I writed code for training GPT-2 on dataset by Hugging Face, but I have an error and don't know why I got this error:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-7-b3178137a672> in <cell line: 17>()
15 )
16
---> 17 trainer.train()
18 model.save_pretrained('/content/drive/MyDrive/MyGPT')
11 frames
/usr/local/lib/python3.10/dist-packages/datasets/formatting/formatting.py in _check_valid_index_key(key, size)
524 if isinstance(key, int):
525 if (key < 0 and key + size < 0) or (key >= size):
--> 526 raise IndexError(f"Invalid key: {key} is out of bounds for size {size}")
527 return
528 elif isinstance(key, slice):
IndexError: Invalid key: 409862 is out of bounds for size 0
in Hugging Face Transformers Google Colab code here:
!pip install 'transformers[torch]'
!pip install datasets
from datasets import load_dataset
dataset = load_dataset("Nan-Do/instructional_code-search-net-python")
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments
tokenizer = GPT2Tokenizer.from_pretrained('sberbank-ai/rugpt3large_based_on_gpt2')
model = GPT2LMHeadModel.from_pretrained('sberbank-ai/rugpt3large_based_on_gpt2')
def prepare_data(data):
input_ids = []
attention_masks = []
for text in data:
encoded = tokenizer.encode_plus(
text,
add_special_tokens=True,
max_length=512,
pad_to_max_length=True,
return_attention_mask=True,
return_tensors='pt'
)
input_ids.append(encoded['input_ids'])
attention_masks.append(encoded['attention_mask'])
return {
'input_ids': torch.cat(input_ids, dim=0),
'attention_mask': torch.cat(attention_masks, dim=0)
}
training_args = TrainingArguments(
output_dir='./results',
num_train_epochs=8000,
per_device_train_batch_size=2,
save_steps=2000,
save_total_limit=2,
prediction_loss_only=True,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset['train'],
data_collator=prepare_data,
)
trainer.train()
model.save_pretrained('/content/drive/MyDrive/MyGPT')
I tried to add
optimizer = TorchAdamW(model.parameters(), lr=1e-3) from torch.optim import AdamW as TorchAdamW
By ChatGPT's advice, but this didn't help.
And I searched the internet but didn't find the answer to solve this error.