When providing data to a machine learning model, there seem to be two ways: using a whole dataset or using a data generator. And it seems that a data generator is considered to be better because it can save memory. I am wondering several things.
If the dataset to be used is not stored locally, but is to be downloaded when used(like the example below), it seems to me that whether using a data generator will not matter about memory usage, because the data has to be downloaded and stored somewhere. In this case, what are the benefits of using a data generator?
A generator can only be used once, as said in What does the "yield" keyword do?. So after the data generator is used, do I have to create a new one if I want to train more? Or to put it another way, if I'd like to have a view of some training samples, I wrote before training
next(train_iter)
next(train_iter)
And during training, will the first two samples disappear because I have used them?
- If the whole dataset is loaded, I can take a view of the tenth sample by
print(trainX[10])
But when using a data generator, it can only be done by
for i in range(9):
next(trainX_iter)
print(next(trainX_iter))
which seems to be a little wierd. Is there an elegant way?
And one example of preferring data generator in pytorch: Migrate torchtext from the legacy API to the new API. The library's new version loads data by an iterator while the old version does not.
Old version
import torchtext
import torch
from torchtext.legacy import data
from torchtext.legacy import datasets
TEXT = data.Field()
LABEL = data.LabelField(dtype = torch.long)
legacy_train, legacy_test = datasets.IMDB.splits(TEXT, LABEL) # datasets here refers to torchtext.legacy.datasets
legacy_examples = legacy_train.examples
print(legacy_examples[0].text, legacy_examples[0].label)
New version
from torchtext.datasets import IMDB
train_iter, test_iter = IMDB(split=('train', 'test'))