13

I'm trying to build a datasetDictionary object to train a QA model on PyTorch. I have these two different datasets:

test_dataset

Dataset({
    features: ['answer_text', 'answer_start', 'title', 'context', 'question', 'answers', 'id'],
    num_rows: 21489
})

and

train_dataset

Dataset({
    features: ['answer_text', 'answer_start', 'title', 'context', 'question', 'answers', 'id'],
    num_rows: 54159
})

In the dataset's documentation I didn't find anything. I'm quite a noob, thus the solution may be really easy. What I wish to obtain is something like this:

dataset

DatasetDict({
    train: Dataset({
        features: ['answer_text', 'answer_start', 'title', 'context', 'question', 'answers', 'id'],
        num_rows: 54159
    })
    test: Dataset({
        features: ['answer_text', 'answer_start', 'title', 'context', 'question', 'answers', 'id'],
        num_rows: 21489
    })
})

I really don't find how to use two datasets to create a dataserDict or how to set the keys. Moreover, I wish to "cut" the train set in two: train and validation sets, but also this passage is hard for me to handle. The final result should be something like this:

dataset

DatasetDict({
    train: Dataset({
        features: ['answer_text', 'answer_start', 'title', 'context', 'question', 'answers', 'id'],
        num_rows: 54159 - x
    })
    validation: Dataset({
        features: ['answer_text', 'answer_start', 'title', 'context', 'question', 'answers', 'id'],
        num_rows: x
    })
    test: Dataset({
        features: ['answer_text', 'answer_start', 'title', 'context', 'question', 'answers', 'id'],
        num_rows: 21489
    })
})

Thank you in advance and pardon me for being a noob :)

Ondiek Elijah
  • 547
  • 8
  • 15
Peppe95
  • 133
  • 1
  • 1
  • 6

3 Answers3

16

to get the validation dataset, you can do like this:

train_dataset, validation_dataset= train_dataset.train_test_split(test_size=0.1).values()

This function will divide 10% of the train dataset into the validation dataset.

and to obtain "DatasetDict", you can do like this:

import datasets
dd = datasets.DatasetDict({"train":train_dataset,"test":test_dataset})
Dharman
  • 30,962
  • 25
  • 85
  • 135
Lin
  • 169
  • 3
9

For future generations ;) Adding a bit more information about the answer.

from datasets.dataset_dict import DatasetDict
from datasets import Dataset

d = {'train':Dataset.from_dict({'label':y_train,'text':x_train}),
     'val':Dataset.from_dict({'label':y_val,'text':x_val}),
     'test':Dataset.from_dict({'label':y_test,'text':x_test})
     }

DatasetDict(d)
Sahar Millis
  • 801
  • 2
  • 13
  • 21
3

I resolved a similar issue while creating a DatasetDict loading data directly from a csv file. As the documentation states, it's just necessary to load the file like this:

from datasets import load_dataset
dataset = load_dataset('csv', data_files='my_file.csv')

If someone needs to load multiple csv file it's possible too.

After that, as suggested by @Lin, an easy method to split by training and validation set is the following

train_dataset, validation_dataset= dataset['train'].train_test_split(test_size=0.1).values()

Finally, set the DatasetDict like this:

dataset = DatasetDict({'train': train_dataset, 'val': validation_dataset})