I have a corpus that is 16 GB large and my ram IS around 16 GB ish. If I load the entire dataset to train the language model RoBERTa from scratch, I am going to have a memory issue. I intend to train my RoBERTa using the script provided from Huggingface's tutorial in their blog post: https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb
However, their blog post suggests the usage of LineByLineTextDatase. However, this loads the dataset eagerly.
class LineByLineTextDataset(Dataset):
"""
This will be superseded by a framework-agnostic approach
soon.
"""
def __init__(self, tokenizer: PreTrainedTokenizer, file_path: str, block_size: int):
assert os.path.isfile(file_path)
# Here, we do not cache the features, operating under the assumption
# that we will soon use fast multithreaded tokenizers from the
# `tokenizers` repo everywhere =)
logger.info("Creating features from dataset file at %s", file_path)
with open(file_path, encoding="utf-8") as f:
lines = [line for line in f.read().splitlines() if (len(line) > 0 and not line.isspace())]
batch_encoding = tokenizer(lines, add_special_tokens=True, truncation=True, max_length=block_size)
self.examples = batch_encoding["input_ids"]
def __len__(self):
return len(self.examples)
def __getitem__(self, i) -> torch.Tensor:
return torch.tensor(self.examples[i], dtype=torch.long)
Unexpectedly, my kernel crashed on the part where they read the line. I wonder if there is a way to make it read lazily. It will be very desirable if the suggested answer can create minimum code change with the posted tutorial since I'm rather new with Huggingface and afraid I won't be able to debug it on my own.