I am trying to load a training dataset in my Google Colab notebook but keep getting an error. This happens exclusively in Colab, since when I run the same notebook in VS Code there is no problem in loading.
Here is the code snippet which returns the error:
dataset_id ="nielsr/funsd-layoutlmv3"
from datasets import load_dataset
dataset = load_dataset(dataset_id)
print(f"Train dataset size: {len(dataset['train'])}")
print(f"Test dataset size: {len(dataset['test'])}")
And, in Colab, this returns:
Downloading and preparing dataset funsd-layoutlmv3/funsd to /root/.cache/huggingface/datasets/nielsr___funsd-layoutlmv3/funsd/1.0.0/0e3f4efdfd59aa1c3b4952c517894f7b1fc4d75c12ef01bcc8626a69e41c1bb9...
---------------------------------------------------------------------------
UnidentifiedImageError Traceback (most recent call last)
/usr/local/lib/python3.8/dist-packages/datasets/builder.py in _prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, split_info, check_duplicate_keys, job_id)
1569 _time = time.time()
-> 1570 for key, record in generator:
1571 if max_shard_size is not None and writer._num_bytes > max_shard_size:
9 frames
UnidentifiedImageError: cannot identify image file '/root/.cache/huggingface/datasets/downloads/extracted/e5bbbc543f8cc95554da124f3e80a57ed24d67d06ae1467da5810703f851e3f9/dataset/training_data/images/0000971160.png'
The above exception was the direct cause of the following exception:
DatasetGenerationError Traceback (most recent call last)
/usr/local/lib/python3.8/dist-packages/datasets/builder.py in _prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, split_info, check_duplicate_keys, job_id)
1604 if isinstance(e, SchemaInferenceError) and e.__context__ is not None:
1605 e = e.__context__
-> 1606 raise DatasetGenerationError("An error occurred while generating the dataset") from e
1607
1608 yield job_id, True, (total_num_examples, total_num_bytes, writer._features, num_shards, shard_lengths)
DatasetGenerationError: An error occurred while generating the dataset
I was expecting to receive the following (which I succesfully got in VS):
Train dataset size: 149
Test dataset size: 50
Thank you in advance!