I have a question regarding "on-the-fly" tokenization. This question was elicited by reading the "How to train a new language model from scratch using Transformers and Tokenizers" here. Towards the end there is this sentence: "If your dataset is very large, you can opt to load and tokenize examples on the fly, rather than as a preprocessing step". I've tried coming up with a solution that would combine both datasets
and tokenizers
, but did not manage to find a good pattern.
I guess the solution would entail wrapping a dataset into a Pytorch dataset.
As a concrete example from the docs
import torch
class SquadDataset(torch.utils.data.Dataset):
def __init__(self, encodings):
# instead of doing this beforehand, I'd like to do tokenization on the fly
self.encodings = encodings
def __getitem__(self, idx):
return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
def __len__(self):
return len(self.encodings.input_ids)
train_dataset = SquadDataset(train_encodings)
How would one implement this with "on-the-fly" tokenization exploiting the vectorized capabilities of tokenizers?