Tensorflow: create minibatch from numpy array > 2 GB

Question

I am trying to feed minibatches of numpy arrays to my model, but I'm stuck with batching. Using 'tf.train.shuffle_batch' raises an error because the 'images' array is larger than 2 GB. I tried to go around it and create placeholders, but when I try to feed the the arrays they are still represented by tf.Tensor objects. My main concern is that I defined the operations under the model class and the objects don't get called before running the session. Does anyone have an idea how to handle this issue?

def main(mode, steps):
  config = Configuration(mode, steps)



  if config.TRAIN_MODE:

      images, labels = read_data(config.simID)

      assert images.shape[0] == labels.shape[0]

      images_placeholder = tf.placeholder(images.dtype,
                                                images.shape)
      labels_placeholder = tf.placeholder(labels.dtype,
                                                labels.shape)

      dataset = tf.data.Dataset.from_tensor_slices(
                (images_placeholder, labels_placeholder))

      # shuffle
      dataset = dataset.shuffle(buffer_size=1000)

      # batch
      dataset = dataset.batch(batch_size=config.batch_size)

      iterator = dataset.make_initializable_iterator()

      image, label = iterator.get_next()

      model = Model(config, image, label)

      with tf.Session() as sess:

          sess.run(tf.global_variables_initializer())

          sess.run(iterator.initializer, 
                   feed_dict={images_placeholder: images,
                          labels_placeholder: labels})

          # ...

          for step in xrange(steps):

              sess.run(model.optimize)

It looks like you are not initializing the dataset iterator you create. See [here](https://www.tensorflow.org/programmers_guide/datasets#consuming_numpy_arrays) for more information. — mikkola, Mar 01 '18 at 15:55
I added the initialization, still when I evaluate the "sess.run(model.optimize..." line I get the error: "TypeError: The value of a feed cannot be a tf.Tensor object. Acceptable feed values include Python scalars, strings, lists, numpy ndarrays, or TensorHandles." — Dávid Papp, Mar 01 '18 at 16:13
If you use the iterator to feed data to your model, do not use the `feed_dict` input to `sess.run`. Instead, specify your model in terms of the outputs of `get_next()`: use `image_batch`, and `label_batch` in place of `image` and `label` in your model, then just call `sess.run(model.optimize)`. The iterator will feed data to your model in the background, feeding via `feed_dict` is not necessary. — mikkola, Mar 01 '18 at 16:20
I still get an error "You must feed a value for placeholder tensor 'Placeholder' with dtype float and shape [batch,height,width,depth]" After batching the dataset it returns [?,height,width,depth] with the question mark "?" in place of the batch size. — Dávid Papp, Mar 01 '18 at 16:29
Did you remove the placeholders from your model, and instead use `image_batch` and `label_batch` in their place? — mikkola, Mar 01 '18 at 16:30

mikkola · Accepted Answer · 2018-03-02T09:23:53.930

2

You are using the initializable iterator of tf.Data to feed data to your model. This means that you can parametrize the dataset in terms of placeholders, and then call an initializer op for the iterator to prepare it for use.

In case you use the initializable iterator, or any other iterator from tf.Data to feed inputs to your model, you should not use the feed_dict argument of sess.run to try to do data feeding. Instead, define your model in terms of the outputs of iterator.get_next() and omit the feed_dict from sess.run.

Something along these lines:

iterator = dataset.make_initializable_iterator()
image_batch, label_batch = iterator.get_next()

# use get_next outputs to define model
model = Model(config, image_batch, label_batch) 

# placeholders fed in while initializing the iterator
sess.run(iterator.initializer, 
            feed_dict={images_placeholder: images,
                       labels_placeholder: labels})

for step in xrange(steps):
     # iterator will feed image and label in the background
     sess.run(model.optimize)

The iterator will feed data to your model in the background, additional feeding via feed_dict is not necessary.

edited Mar 02 '18 at 09:23

answered Mar 01 '18 at 16:26

mikkola

3,376
1
19
41

Seems like I am getting there, although, the dataset.batch command still returns "batch size=?". Do you have any idea why this might happen? Or is it normal? – Dávid Papp Mar 01 '18 at 18:15
@DávidPapp In `dataset.batch(batch_size)`, `batch_size` should be a `tf.int64` scalar tf.Tensor, or convertible into such (e.g., a regular integer). Make sure that your `config.batch_size` fulfills either of these requirements. I don't remember seeing `dataset.batch` output `?`, so I can't say for sure if it is normal... -- Are you sure that is the operation that is emitting the output? – mikkola Mar 01 '18 at 18:19
1

The problem of the batch size apparently came from some intended behavior of of the batch function as described [here](https://github.com/tensorflow/tensorflow/issues/13161) – Dávid Papp Mar 02 '18 at 09:15
@DávidPapp good find. That happens because the dataset does not exactly split into full batches (last batch has fewer samples than others). I am glad your issue was solved! – mikkola Mar 02 '18 at 09:24

Tensorflow: create minibatch from numpy array > 2 GB

1 Answers1

Linked