0

I am going through Hugging Face's Tutorial on Fine-Tuning BERT with a custom dataset but am unable to follow along as Google Colab has been executing for over an hour (so far) just trying to load the data.

The tutorial uses the IMDB Review dataset which I downloaded to my Google Drive account (in order to mimic how I will be getting data for my actual project). The dataset contains 50000 movie reviews in which each movie review is saved as a standalone txt file.

The code that I am trying to execute is the following (taken from the tutorial):

from pathlib import Path

def read_imdb_split(split_dir):
    split_dir = Path(split_dir)
    texts = []
    labels = []
    for label_dir in ["pos", "neg"]:
        for text_file in (split_dir/label_dir).iterdir():
            texts.append(text_file.read_text())
            labels.append(0 if label_dir is "neg" else 1)

    return texts, labels

train_texts, train_labels = read_imdb_split('/content/drive/MyDrive/aclImdb/train')
test_texts, test_labels = read_imdb_split('/content/drive/MyDrive/aclImdb/test')

It is only 210 MB so I do not understand how it could possibly be taking so long. Is it normal for it to be taking this long? What can I do?

I will also mention that I have Colab Pro and am using a GPU with High-RAM.

Luca Guarro
  • 1,085
  • 1
  • 11
  • 25
  • Does this answer your question? [Google Colab is very slow compared to my PC](https://stackoverflow.com/questions/49360888/google-colab-is-very-slow-compared-to-my-pc) – Nikko J. Mar 30 '21 at 21:50
  • Not fully. I know other people are having similar problems but the fact that with 1.5 hours I am still unable to load 210MB of text data, makes me feel like I am doing something wrong – Luca Guarro Mar 30 '21 at 21:59

0 Answers0