The best approach is not to write your own dataset readers from scratch, as this tutorial does, but to use the Hugging Face datasets library, which is already integrated with Hugging Face transformers.
Here is a step-by-step guide on how to adapt the tutorial to the datasets library:
First, we have to turn the original CSV from the tutorial into something that can be loaded with the load_dataset function. We are going to preprocess the original train.csv
CSV and save the files new_train.csv
and validation.csv
.
from sklearn.model_selection import train_test_split
data = pd.read_csv("train.csv")
data["label"] = data["sentiment"]
train, validation = train_test_split(data, test_size=0.2)
train.to_csv("new_train.csv")
validation.to_csv("validation.csv")
The documentation provides an example on how to create your own cross-validation splits. Here we adapt it to our use case:
val_ds = datasets.load_dataset("csv", data_files={"validation": "validation.csv"}, split=[f"validation[{k}%:{k+10}%]" for k in range(0, 100, 10)])
train_ds = datasets.load_dataset("csv", data_files={"train": "new_train.csv"}, split=[f"train[:{k}%]+train[{k+10}%:]" for k in range(0, 100, 10)])
Now we tokenize the splits:
def preprocess_function(examples):
# Tokenize the texts
args = ((examples["review"],))
result = tokenizer(*args, padding=True, max_length=128, truncation=True)
result["label"] = examples["label"]
return result
for idx, item in enumerate(train_ds):
train_ds[idx] = train_ds[idx].map(
preprocess_function,
batched=True,
desc="Running tokenizer on dataset",
)
for idx, item in enumerate(val_ds):
val_ds[idx] = val_ds[idx].map(
preprocess_function,
batched=True,
desc="Running tokenizer on dataset",
)
After that, you simply have to loop over the splits and pass them to your Trainer
.
for train_dataset, val_dataset in zip(train_ds, val_ds):
trainer = Trainer(
model=model,
args=args,
train_dataset=train_dataset,
eval_dataset=val_dataset,
compute_metrics=compute_metrics,
callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
)
...