Dataset does not fit in memory

Question

I have an MNIST like dataset that does not fit in memory, (process memory, not gpu memory). My dataset is 4GB.

This is not a TFLearn issue.

As far as I know model.fit requires an array for x and y.

TFLearn example:

model.fit(x, y, n_epoch=10, validation_set=(val_x, val_y))

I was wondering is there's a way where we can pass a "batch iterator", instead of an array. Basically for each batch I would load the necessary data from disk.

This way I would not run into process memory overflow errors.

EDIT np.memmap could be an option. But I don't see how to skip the first few bytes that compose the header.

you probably need to use a queue where set batch size and capacity appropriately. The `tf.train.shuffle_batch` should work here. — brown.2179, Oct 09 '17 at 17:17
Use the offset argument for numpy.memmap which takes in the number of bytes to skip from the beginning of file. Numpy.float32 == 4 bytes, float64 == 8 bytes, etc — JYun, Apr 10 '18 at 01:37

michael_question_answerer · Answer 1 · 2021-02-17T13:01:00.107

3

You can use the Dataset api.

"The Dataset API supports a variety of file formats so that you can process large datasets that do not fit in memory"

Basically the input pipeline would become part of your graph.

If memory is still an issue then you can use a generator to create your tf.data.Dataset. Further, you could potentially make the process quicker by preparing tfrecords to create your Dataset.

edited Feb 17 '21 at 13:01

answered Sep 05 '18 at 13:24

michael_question_answerer

946
1
11
24

Dataset does not fit in memory

1 Answers1