I'm trying to build a datasetDictionary object to train a QA model on PyTorch. I have these two different datasets:
test_dataset
Dataset({
features: ['answer_text', 'answer_start', 'title', 'context', 'question', 'answers', 'id'],
num_rows: 21489
})
and
train_dataset
Dataset({
features: ['answer_text', 'answer_start', 'title', 'context', 'question', 'answers', 'id'],
num_rows: 54159
})
In the dataset's documentation I didn't find anything. I'm quite a noob, thus the solution may be really easy. What I wish to obtain is something like this:
dataset
DatasetDict({
train: Dataset({
features: ['answer_text', 'answer_start', 'title', 'context', 'question', 'answers', 'id'],
num_rows: 54159
})
test: Dataset({
features: ['answer_text', 'answer_start', 'title', 'context', 'question', 'answers', 'id'],
num_rows: 21489
})
})
I really don't find how to use two datasets to create a dataserDict or how to set the keys. Moreover, I wish to "cut" the train set in two: train and validation sets, but also this passage is hard for me to handle. The final result should be something like this:
dataset
DatasetDict({
train: Dataset({
features: ['answer_text', 'answer_start', 'title', 'context', 'question', 'answers', 'id'],
num_rows: 54159 - x
})
validation: Dataset({
features: ['answer_text', 'answer_start', 'title', 'context', 'question', 'answers', 'id'],
num_rows: x
})
test: Dataset({
features: ['answer_text', 'answer_start', 'title', 'context', 'question', 'answers', 'id'],
num_rows: 21489
})
})
Thank you in advance and pardon me for being a noob :)