How can I handle this datasets to create a datasetDict?

Question

I'm trying to build a datasetDictionary object to train a QA model on PyTorch. I have these two different datasets:

test_dataset

Dataset({
    features: ['answer_text', 'answer_start', 'title', 'context', 'question', 'answers', 'id'],
    num_rows: 21489
})

and

train_dataset

Dataset({
    features: ['answer_text', 'answer_start', 'title', 'context', 'question', 'answers', 'id'],
    num_rows: 54159
})

In the dataset's documentation I didn't find anything. I'm quite a noob, thus the solution may be really easy. What I wish to obtain is something like this:

dataset

DatasetDict({
    train: Dataset({
        features: ['answer_text', 'answer_start', 'title', 'context', 'question', 'answers', 'id'],
        num_rows: 54159
    })
    test: Dataset({
        features: ['answer_text', 'answer_start', 'title', 'context', 'question', 'answers', 'id'],
        num_rows: 21489
    })
})

I really don't find how to use two datasets to create a dataserDict or how to set the keys. Moreover, I wish to "cut" the train set in two: train and validation sets, but also this passage is hard for me to handle. The final result should be something like this:

dataset

DatasetDict({
    train: Dataset({
        features: ['answer_text', 'answer_start', 'title', 'context', 'question', 'answers', 'id'],
        num_rows: 54159 - x
    })
    validation: Dataset({
        features: ['answer_text', 'answer_start', 'title', 'context', 'question', 'answers', 'id'],
        num_rows: x
    })
    test: Dataset({
        features: ['answer_text', 'answer_start', 'title', 'context', 'question', 'answers', 'id'],
        num_rows: 21489
    })
})

Thank you in advance and pardon me for being a noob :)

score 16 · Answer 1 · edited Jun 30 '21 at 17:33

16

to get the validation dataset, you can do like this:

train_dataset, validation_dataset= train_dataset.train_test_split(test_size=0.1).values()

This function will divide 10% of the train dataset into the validation dataset.

and to obtain "DatasetDict", you can do like this:

import datasets
dd = datasets.DatasetDict({"train":train_dataset,"test":test_dataset})

edited Jun 30 '21 at 17:33

Dharman

30,962
25
85
135

answered Jun 30 '21 at 17:28

Lin

169
3

score 9 · Answer 2 · answered May 23 '22 at 14:02

For future generations ;) Adding a bit more information about the answer.

from datasets.dataset_dict import DatasetDict
from datasets import Dataset

d = {'train':Dataset.from_dict({'label':y_train,'text':x_train}),
     'val':Dataset.from_dict({'label':y_val,'text':x_val}),
     'test':Dataset.from_dict({'label':y_test,'text':x_test})
     }

DatasetDict(d)

score 3 · Answer 3 · answered Jul 13 '22 at 15:45

I resolved a similar issue while creating a DatasetDict loading data directly from a csv file. As the documentation states, it's just necessary to load the file like this:

from datasets import load_dataset
dataset = load_dataset('csv', data_files='my_file.csv')

If someone needs to load multiple csv file it's possible too.

After that, as suggested by @Lin, an easy method to split by training and validation set is the following

train_dataset, validation_dataset= dataset['train'].train_test_split(test_size=0.1).values()

Finally, set the DatasetDict like this:

dataset = DatasetDict({'train': train_dataset, 'val': validation_dataset})

How can I handle this datasets to create a datasetDict?

3 Answers3