3

I am trying to load a training dataset in my Google Colab notebook but keep getting an error. This happens exclusively in Colab, since when I run the same notebook in VS Code there is no problem in loading.

Here is the code snippet which returns the error:

dataset_id ="nielsr/funsd-layoutlmv3"

from datasets import load_dataset


dataset = load_dataset(dataset_id)

print(f"Train dataset size: {len(dataset['train'])}")
print(f"Test dataset size: {len(dataset['test'])}")

And, in Colab, this returns:

Downloading and preparing dataset funsd-layoutlmv3/funsd to /root/.cache/huggingface/datasets/nielsr___funsd-layoutlmv3/funsd/1.0.0/0e3f4efdfd59aa1c3b4952c517894f7b1fc4d75c12ef01bcc8626a69e41c1bb9...
---------------------------------------------------------------------------
UnidentifiedImageError                    Traceback (most recent call last)
/usr/local/lib/python3.8/dist-packages/datasets/builder.py in _prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, split_info, check_duplicate_keys, job_id)
   1569                 _time = time.time()
-> 1570                 for key, record in generator:
   1571                     if max_shard_size is not None and writer._num_bytes > max_shard_size:

9 frames
UnidentifiedImageError: cannot identify image file '/root/.cache/huggingface/datasets/downloads/extracted/e5bbbc543f8cc95554da124f3e80a57ed24d67d06ae1467da5810703f851e3f9/dataset/training_data/images/0000971160.png'

The above exception was the direct cause of the following exception:

DatasetGenerationError                    Traceback (most recent call last)
/usr/local/lib/python3.8/dist-packages/datasets/builder.py in _prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, split_info, check_duplicate_keys, job_id)
   1604             if isinstance(e, SchemaInferenceError) and e.__context__ is not None:
   1605                 e = e.__context__
-> 1606             raise DatasetGenerationError("An error occurred while generating the dataset") from e
   1607 
   1608         yield job_id, True, (total_num_examples, total_num_bytes, writer._features, num_shards, shard_lengths)

DatasetGenerationError: An error occurred while generating the dataset

I was expecting to receive the following (which I succesfully got in VS):

Train dataset size: 149
Test dataset size: 50

Thank you in advance!

1 Answers1

0

Most probably the file got corrupted when downloading. Try this:

from datasets import load_dataset


dataset = load_dataset("nielsr/funsd-layoutlmv3", download_mode="force_redownload")

print(f"Train dataset size: {len(dataset['train'])}")
print(f"Test dataset size: {len(dataset['test'])}")

It should output this on colab:

enter image description here

alvas
  • 115,346
  • 109
  • 446
  • 738