1

Suppose, I have a huge list of objects, each of which could be a list of numpy arrays, for example.

What’s the best way to pass this dataset to tensorflow?

I want to be able to randomly shuffle the data and form batches. May be it’s worth to shuffle the dataset and form batches using standard python(numpy) procedures and after that use something like tf.data.Dataset.from_generator()?

Straightforward approach of transforming full dataset to tf.Tensor seems to be useless due to size limit for the tf.GraphDef protocol buffer(according to the Tensorflow documentation).

Mohan Radhakrishnan
  • 3,002
  • 5
  • 28
  • 42
user1786577
  • 139
  • 4

1 Answers1

0

Looks like your data is large but still small enough to fit in memory? If so, you are on the right track with tf.data.Dataset.from_generator(). You could then shuffle and batch with something like

import itertools

# your data
data = range(1024)
def gen():
  for item in data:
    yield data

ds = Dataset.from_generator(
    gen, tf.int64, tf.TensorShape([])).shuffle(buffer_size=128).batch(batch_size=4)
value = ds.make_one_shot_iterator().get_next()

sess.run(value)  # array([0, 1, 2, 3])

Alternatively, you could dump your data to a TFRecord file and read from it using the TFRecordDataset. This test should help you get started.

Saurabh Saxena
  • 489
  • 1
  • 3
  • 9