10

How do I write a HuggingFace dataset to disk?

I have made my own HuggingFace dataset using a JSONL file:

Dataset({ features: ['id', 'text'], num_rows: 18 })

I would like to persist the dataset to disk.

Is there a preferred way to do this? Or, is the only option to use a general purpose library like joblib or pickle?

Campbell Hutcheson
  • 549
  • 2
  • 4
  • 12

2 Answers2

15

You can save a HuggingFace dataset to disk using the save_to_disk() method.

For example:

from datasets import load_dataset
  
test_dataset = load_dataset("json", data_files="test.json", split="train")

test_dataset.save_to_disk("test.hf")
Timbus Calin
  • 13,809
  • 5
  • 41
  • 59
Campbell Hutcheson
  • 549
  • 2
  • 4
  • 12
2

You can save the dataset in any format you like using the to_ function. See the following snippet as an example:

from datasets import load_dataset
dataset = load_dataset("squad")
for split, dataset in dataset.items():
    dataset.to_json(f"squad-{split}.jsonl")

For more information look at the official Huggingface script: https://colab.research.google.com/github/huggingface/notebooks/blob/master/course/videos/save_load_dataset.ipynb#scrollTo=8PZbm6QOAtGO