3

I have a dataset with type dictionary which I converted to Dataset:

ds = datasets.Dataset.from_dict(bio_dict)

The shape now is:

Dataset({
    features: ['id', 'text', 'ner_tags', 'input_ids', 'attention_mask', 'label'],
    num_rows: 8805
})

When I use the train_test_split function of Datasets I receive the following error:

train_testvalid = ds.train_test_split(test_size=0.5, shuffle=True, stratify_by_column="label")

ValueError: Stratifying by column is only supported for ClassLabel column, and column label is Sequence.

How can I change the type to ClassLabel so that stratify works?

Jason Aller
  • 3,541
  • 28
  • 38
  • 38
Yana
  • 785
  • 8
  • 23

1 Answers1

3

You should apply the following class_encode_column function:

ds = ds.class_encode_column("label")