2

I have a corpus that is 16 GB large and my ram IS around 16 GB ish. If I load the entire dataset to train the language model RoBERTa from scratch, I am going to have a memory issue. I intend to train my RoBERTa using the script provided from Huggingface's tutorial in their blog post: https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb

However, their blog post suggests the usage of LineByLineTextDatase. However, this loads the dataset eagerly.

class LineByLineTextDataset(Dataset):
    """
    This will be superseded by a framework-agnostic approach
    soon.
    """

    def __init__(self, tokenizer: PreTrainedTokenizer, file_path: str, block_size: int):
        assert os.path.isfile(file_path)
        # Here, we do not cache the features, operating under the assumption
        # that we will soon use fast multithreaded tokenizers from the
        # `tokenizers` repo everywhere =)
        logger.info("Creating features from dataset file at %s", file_path)

        with open(file_path, encoding="utf-8") as f:
            lines = [line for line in f.read().splitlines() if (len(line) > 0 and not line.isspace())]

        batch_encoding = tokenizer(lines, add_special_tokens=True, truncation=True, max_length=block_size)
        self.examples = batch_encoding["input_ids"]

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, i) -> torch.Tensor:
        return torch.tensor(self.examples[i], dtype=torch.long)

Unexpectedly, my kernel crashed on the part where they read the line. I wonder if there is a way to make it read lazily. It will be very desirable if the suggested answer can create minimum code change with the posted tutorial since I'm rather new with Huggingface and afraid I won't be able to debug it on my own.

cronoik
  • 15,434
  • 3
  • 40
  • 78
Realdeo
  • 449
  • 6
  • 19

1 Answers1

1

I would recommend using HuggingFace's own datasets library. The documentation says:

It provides a very efficient way to load and process data from raw files (CSV/JSON/text) or in-memory data (python dict, pandas dataframe) with a special focus on memory efficiency and speed. As a matter of example, loading a 18GB dataset like English Wikipedia allocate 9 MB in RAM and you can iterate over the dataset at 1-2 GBit/s in python.

The quick tour has good explanations and code snippets for creating a dataset object with your own data and it also explains how to train your own model.

Moritz
  • 2,835
  • 2
  • 6
  • 12