Splitting dataset into Train, Test and Validation using HuggingFace Datasets functions

Question

I can split my dataset into Train and Test split with 80%:20% ratio using:

from datasets import load_dataset
ds = load_dataset("myusername/mycorpus")
ds = ds["train"].train_test_split(test_size=0.2) # my data in HF have 1 train split only
print(ds)

which outputs:

DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 62044
    })
    test: Dataset({
        features: ['translation'],
        num_rows: 15512
    })
})

How can I generate the validation split, with ratio 80%:10%:10%?

score 3 · Answer 1 · answered May 10 '23 at 12:13

from datasets import *
ds = load_dataset("myusername/mycorpus")

train_testvalid = ds['train'].train_test_split(test_size=0.2)
# Split the 10% test + valid in half test, half valid
test_valid = train_testvalid['test'].train_test_split(test_size=0.5)
# gather everyone if you want to have a single DatasetDict
ds = DatasetDict({
    'train': train_testvalid['train'],
    'test': test_valid['test'],
    'valid': test_valid['train']})

that will output a dataset with a following stuctre

DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 62044
    })
    test: Dataset({
        features: ['translation'],
        num_rows: 7756
    })

valid: Dataset({
    features: ['translation'],
    num_rows: 7756
})

})

hope thats help you

score 0 · Answer 2 · answered May 11 '23 at 00:03

TL;DR

from datasets import load_dataset
from datasets import DatasetDict

ds = load_dataset("alvations/xnli-15way")

ds_train_devtest = ds['train'].train_test_split(test_size=0.2, seed=42)
ds_devtest = ds_train_devtest['test'].train_test_split(test_size=0.5, seed=42)


ds_splits = DatasetDict({
    'train': ds_train_devtest['train'],
    'valid': ds_devtest['train'],
    'test': ds_devtest['test']
})

print("Before:\n", ds)
print("After\n", ds_splits)

[out]:

Before:

DatasetDict({
    train: Dataset({
        features: ['ar', 'bg', 'de', 'el', 'en', 'es', 'fr', 'hi', 'ru', 'sw', 'th', 'tr', 'ur', 'vi', 'zh'],
        num_rows: 20000
    })
})

After: 

DatasetDict({
    train: Dataset({
        features: ['ar', 'bg', 'de', 'el', 'en', 'es', 'fr', 'hi', 'ru', 'sw', 'th', 'tr', 'ur', 'vi', 'zh'],
        num_rows: 16000
    })
    valid: Dataset({
        features: ['ar', 'bg', 'de', 'el', 'en', 'es', 'fr', 'hi', 'ru', 'sw', 'th', 'tr', 'ur', 'vi', 'zh'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['ar', 'bg', 'de', 'el', 'en', 'es', 'fr', 'hi', 'ru', 'sw', 'th', 'tr', 'ur', 'vi', 'zh'],
        num_rows: 2000
    })
})

In Long

Using this dataset with only train split as an example:

from datasets import load_dataset

ds = load_dataset("alvations/xnli-15way")

[out]:

DatasetDict({
    train: Dataset({
        features: ['ar', 'bg', 'de', 'el', 'en', 'es', 'fr', 'hi', 'ru', 'sw', 'th', 'tr', 'ur', 'vi', 'zh'],
        num_rows: 20000
    })
})

Then you can first split the 20K rows of training data into 80-20% with:

from datasets import DatasetDict

ds_train_devtest = ds['train'].train_test_split(test_size=0.2, seed=42)

Then split the 4K rows in validation-test set into 50-50%:

ds_devtest = ds_train_devtest['test'].train_test_split(test_size=0.5, seed=42)

And finally put them together as a DatasetDict:

ds_splits = DatasetDict({
    'train': ds_train_devtest['train'],
    'valid': ds_devtest['train'],
    'test': ds_devtest['test']
})

[out]:

DatasetDict({
    train: Dataset({
        features: ['ar', 'bg', 'de', 'el', 'en', 'es', 'fr', 'hi', 'ru', 'sw', 'th', 'tr', 'ur', 'vi', 'zh'],
        num_rows: 16000
    })
    valid: Dataset({
        features: ['ar', 'bg', 'de', 'el', 'en', 'es', 'fr', 'hi', 'ru', 'sw', 'th', 'tr', 'ur', 'vi', 'zh'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['ar', 'bg', 'de', 'el', 'en', 'es', 'fr', 'hi', 'ru', 'sw', 'th', 'tr', 'ur', 'vi', 'zh'],
        num_rows: 2000
    })
})

Reference: https://huggingface.co/docs/datasets/v2.12.0/en/package_reference/main_classes#datasets.Dataset.train_test_split

thanks. If the original dataset has only 1 split, how to split to 8:1:1 ? Thanks. — Raptor, May 12 '23 at 04:13
The TL;DR code splits it to 8-1-1. E.g. from 20000 (one train split) into 16000-2000-2000 — alvas, May 12 '23 at 12:06

Splitting dataset into Train, Test and Validation using HuggingFace Datasets functions

2 Answers2

TL;DR

In Long

Linked