5

I have an MNIST like dataset that does not fit in memory, (process memory, not gpu memory). My dataset is 4GB.

This is not a TFLearn issue.

As far as I know model.fit requires an array for x and y.

TFLearn example:

model.fit(x, y, n_epoch=10, validation_set=(val_x, val_y))

I was wondering is there's a way where we can pass a "batch iterator", instead of an array. Basically for each batch I would load the necessary data from disk.

This way I would not run into process memory overflow errors.

EDIT np.memmap could be an option. But I don't see how to skip the first few bytes that compose the header.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
redb
  • 512
  • 8
  • 22
  • you probably need to use a queue where set batch size and capacity appropriately. The `tf.train.shuffle_batch` should work here. – brown.2179 Oct 09 '17 at 17:17
  • Use the offset argument for numpy.memmap which takes in the number of bytes to skip from the beginning of file. Numpy.float32 == 4 bytes, float64 == 8 bytes, etc – JYun Apr 10 '18 at 01:37

1 Answers1

3

You can use the Dataset api.

"The Dataset API supports a variety of file formats so that you can process large datasets that do not fit in memory"

Basically the input pipeline would become part of your graph.

If memory is still an issue then you can use a generator to create your tf.data.Dataset. Further, you could potentially make the process quicker by preparing tfrecords to create your Dataset.