1

When providing data to a machine learning model, there seem to be two ways: using a whole dataset or using a data generator. And it seems that a data generator is considered to be better because it can save memory. I am wondering several things.

  1. If the dataset to be used is not stored locally, but is to be downloaded when used(like the example below), it seems to me that whether using a data generator will not matter about memory usage, because the data has to be downloaded and stored somewhere. In this case, what are the benefits of using a data generator?

  2. A generator can only be used once, as said in What does the "yield" keyword do?. So after the data generator is used, do I have to create a new one if I want to train more? Or to put it another way, if I'd like to have a view of some training samples, I wrote before training

next(train_iter)
next(train_iter)

And during training, will the first two samples disappear because I have used them?

  1. If the whole dataset is loaded, I can take a view of the tenth sample by
print(trainX[10])

But when using a data generator, it can only be done by

for i in range(9):
    next(trainX_iter)

print(next(trainX_iter))

which seems to be a little wierd. Is there an elegant way?


And one example of preferring data generator in pytorch: Migrate torchtext from the legacy API to the new API. The library's new version loads data by an iterator while the old version does not.

Old version

import torchtext
import torch
from torchtext.legacy import data
from torchtext.legacy import datasets

TEXT = data.Field()
LABEL = data.LabelField(dtype = torch.long)
legacy_train, legacy_test = datasets.IMDB.splits(TEXT, LABEL)  # datasets here refers to torchtext.legacy.datasets

legacy_examples = legacy_train.examples
print(legacy_examples[0].text, legacy_examples[0].label)

New version

from torchtext.datasets import IMDB
train_iter, test_iter = IMDB(split=('train', 'test'))
  • 1
    Using the combination of `Dataset`, `Sampler`, and `DataLoader` ( https://pytorch.org/docs/stable/data.html ) provides massive flexibility in the generation of samples during training/testing. Much more than just if the data set fits into RAM or not. It provides the ability to select and augment samples on the fly. // Its not very useful in the naive application (data set fits into RAM and there is not much problem specific selection and augmentation), and can even hurt performance (not model performance but runtime). But does provide massive benefit for more complex data loading scenarios. – KDecker Mar 21 '22 at 14:54

0 Answers0