Convert pandas dataframe to datasetDict

Question

I cannot find anywhere how to convert a pandas dataframe to type datasets.dataset_dict.DatasetDict, for optimal use in a BERT workflow with a huggingface model. Take these simple dataframes, for example.

train_df = pd.DataFrame({
     "label" : [1, 2, 3],
     "text" : ["apple", "pear", "strawberry"]
})

test_df = pd.DataFrame({
     "label" : [2, 2, 1],
     "text" : ["banana", "pear", "apple"]
})

What is the most efficient way to convert these to the type above?

score 16 · Accepted Answer · answered Mar 25 '22 at 15:47

One possibility is to first create two Datasets and then join them:

import datasets
import pandas as pd


train_df = pd.DataFrame({
     "label" : [1, 2, 3],
     "text" : ["apple", "pear", "strawberry"]
})

test_df = pd.DataFrame({
     "label" : [2, 2, 1],
     "text" : ["banana", "pear", "apple"]
})

train_dataset = Dataset.from_dict(train_df)
test_dataset = Dataset.from_dict(test_df)
my_dataset_dict = datasets.DatasetDict({"train":train_dataset,"test":test_dataset})

The result is:

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 3
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 3
    })
})

Don't you have to do `Dataset.from_pandas(train_df)` and `Dataset.from_pandas(test_df)` instead of using the from_dict? — Vincent Claes, Jul 24 '22 at 09:10
@VincentClaes You can use `.from_dict()` as long as there are no missing values. — JoAnn Alvarez, Sep 20 '22 at 19:25

Convert pandas dataframe to datasetDict

1 Answers1