4

I am following this tutorial from TowardsDataScience for text classification using Huggingface Trainer. To get a more robust model I want to do a K-Fold Cross Validation, but I am not sure how to do this with Huggingface Trainer. Is there a built-in feature from Trainer or how can you do the cross-validation here?

Thanks in advance!

Maxl Gemeinderat
  • 197
  • 3
  • 14

1 Answers1

0

The best approach is not to write your own dataset readers from scratch, as this tutorial does, but to use the Hugging Face datasets library, which is already integrated with Hugging Face transformers.

Here is a step-by-step guide on how to adapt the tutorial to the datasets library:

First, we have to turn the original CSV from the tutorial into something that can be loaded with the load_dataset function. We are going to preprocess the original train.csv CSV and save the files new_train.csv and validation.csv.

from sklearn.model_selection import train_test_split
data = pd.read_csv("train.csv")
data["label"] = data["sentiment"]
train, validation = train_test_split(data, test_size=0.2)
train.to_csv("new_train.csv")
validation.to_csv("validation.csv")

The documentation provides an example on how to create your own cross-validation splits. Here we adapt it to our use case:

val_ds = datasets.load_dataset("csv", data_files={"validation": "validation.csv"}, split=[f"validation[{k}%:{k+10}%]" for k in range(0, 100, 10)])
train_ds = datasets.load_dataset("csv", data_files={"train": "new_train.csv"}, split=[f"train[:{k}%]+train[{k+10}%:]" for k in range(0, 100, 10)])

Now we tokenize the splits:

def preprocess_function(examples):
    # Tokenize the texts
    args = ((examples["review"],))
    result = tokenizer(*args, padding=True, max_length=128, truncation=True)
    result["label"] = examples["label"]
    return result

for idx, item in enumerate(train_ds):
    train_ds[idx] = train_ds[idx].map(
        preprocess_function,
        batched=True,
        desc="Running tokenizer on dataset",
    )

for idx, item in enumerate(val_ds):
    val_ds[idx] = val_ds[idx].map(
        preprocess_function,
        batched=True,
        desc="Running tokenizer on dataset",
    )

After that, you simply have to loop over the splits and pass them to your Trainer.

for train_dataset, val_dataset in zip(train_ds, val_ds):
    trainer = Trainer(
        model=model,
        args=args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        compute_metrics=compute_metrics,
        callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
    )
...
Ruan
  • 772
  • 4
  • 13