1

I can split my dataset into Train and Test split with 80%:20% ratio using:

from datasets import load_dataset
ds = load_dataset("myusername/mycorpus")
ds = ds["train"].train_test_split(test_size=0.2) # my data in HF have 1 train split only
print(ds)

which outputs:

DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 62044
    })
    test: Dataset({
        features: ['translation'],
        num_rows: 15512
    })
})

How can I generate the validation split, with ratio 80%:10%:10%?

alvas
  • 115,346
  • 109
  • 446
  • 738
Raptor
  • 53,206
  • 45
  • 230
  • 366

2 Answers2

3
from datasets import *
ds = load_dataset("myusername/mycorpus")

train_testvalid = ds['train'].train_test_split(test_size=0.2)
# Split the 10% test + valid in half test, half valid
test_valid = train_testvalid['test'].train_test_split(test_size=0.5)
# gather everyone if you want to have a single DatasetDict
ds = DatasetDict({
    'train': train_testvalid['train'],
    'test': test_valid['test'],
    'valid': test_valid['train']})

that will output a dataset with a following stuctre

DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 62044
    })
    test: Dataset({
        features: ['translation'],
        num_rows: 7756
    })

valid: Dataset({
    features: ['translation'],
    num_rows: 7756
})

})

hope thats help you

Ammar
  • 43
  • 7
0

TL;DR

from datasets import load_dataset
from datasets import DatasetDict

ds = load_dataset("alvations/xnli-15way")

ds_train_devtest = ds['train'].train_test_split(test_size=0.2, seed=42)
ds_devtest = ds_train_devtest['test'].train_test_split(test_size=0.5, seed=42)


ds_splits = DatasetDict({
    'train': ds_train_devtest['train'],
    'valid': ds_devtest['train'],
    'test': ds_devtest['test']
})

print("Before:\n", ds)
print("After\n", ds_splits)

[out]:

Before:

DatasetDict({
    train: Dataset({
        features: ['ar', 'bg', 'de', 'el', 'en', 'es', 'fr', 'hi', 'ru', 'sw', 'th', 'tr', 'ur', 'vi', 'zh'],
        num_rows: 20000
    })
})

After: 

DatasetDict({
    train: Dataset({
        features: ['ar', 'bg', 'de', 'el', 'en', 'es', 'fr', 'hi', 'ru', 'sw', 'th', 'tr', 'ur', 'vi', 'zh'],
        num_rows: 16000
    })
    valid: Dataset({
        features: ['ar', 'bg', 'de', 'el', 'en', 'es', 'fr', 'hi', 'ru', 'sw', 'th', 'tr', 'ur', 'vi', 'zh'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['ar', 'bg', 'de', 'el', 'en', 'es', 'fr', 'hi', 'ru', 'sw', 'th', 'tr', 'ur', 'vi', 'zh'],
        num_rows: 2000
    })
})

In Long

Using this dataset with only train split as an example:

from datasets import load_dataset

ds = load_dataset("alvations/xnli-15way")

[out]:

DatasetDict({
    train: Dataset({
        features: ['ar', 'bg', 'de', 'el', 'en', 'es', 'fr', 'hi', 'ru', 'sw', 'th', 'tr', 'ur', 'vi', 'zh'],
        num_rows: 20000
    })
})

Then you can first split the 20K rows of training data into 80-20% with:

from datasets import DatasetDict

ds_train_devtest = ds['train'].train_test_split(test_size=0.2, seed=42)

Then split the 4K rows in validation-test set into 50-50%:

ds_devtest = ds_train_devtest['test'].train_test_split(test_size=0.5, seed=42)

And finally put them together as a DatasetDict:

ds_splits = DatasetDict({
    'train': ds_train_devtest['train'],
    'valid': ds_devtest['train'],
    'test': ds_devtest['test']
})

[out]:

DatasetDict({
    train: Dataset({
        features: ['ar', 'bg', 'de', 'el', 'en', 'es', 'fr', 'hi', 'ru', 'sw', 'th', 'tr', 'ur', 'vi', 'zh'],
        num_rows: 16000
    })
    valid: Dataset({
        features: ['ar', 'bg', 'de', 'el', 'en', 'es', 'fr', 'hi', 'ru', 'sw', 'th', 'tr', 'ur', 'vi', 'zh'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['ar', 'bg', 'de', 'el', 'en', 'es', 'fr', 'hi', 'ru', 'sw', 'th', 'tr', 'ur', 'vi', 'zh'],
        num_rows: 2000
    })
})

Reference: https://huggingface.co/docs/datasets/v2.12.0/en/package_reference/main_classes#datasets.Dataset.train_test_split

alvas
  • 115,346
  • 109
  • 446
  • 738