I am going through Hugging Face's Tutorial on Fine-Tuning BERT with a custom dataset but am unable to follow along as Google Colab has been executing for over an hour (so far) just trying to load the data.
The tutorial uses the IMDB Review dataset which I downloaded to my Google Drive account (in order to mimic how I will be getting data for my actual project). The dataset contains 50000 movie reviews in which each movie review is saved as a standalone txt file.
The code that I am trying to execute is the following (taken from the tutorial):
from pathlib import Path
def read_imdb_split(split_dir):
split_dir = Path(split_dir)
texts = []
labels = []
for label_dir in ["pos", "neg"]:
for text_file in (split_dir/label_dir).iterdir():
texts.append(text_file.read_text())
labels.append(0 if label_dir is "neg" else 1)
return texts, labels
train_texts, train_labels = read_imdb_split('/content/drive/MyDrive/aclImdb/train')
test_texts, test_labels = read_imdb_split('/content/drive/MyDrive/aclImdb/test')
It is only 210 MB so I do not understand how it could possibly be taking so long. Is it normal for it to be taking this long? What can I do?
I will also mention that I have Colab Pro and am using a GPU with High-RAM.