Consuming big data by Tensorflow

Question

Suppose, I have a huge list of objects, each of which could be a list of numpy arrays, for example.

What’s the best way to pass this dataset to tensorflow?

I want to be able to randomly shuffle the data and form batches. May be it’s worth to shuffle the dataset and form batches using standard python(numpy) procedures and after that use something like tf.data.Dataset.from_generator()?

Straightforward approach of transforming full dataset to tf.Tensor seems to be useless due to size limit for the tf.GraphDef protocol buffer(according to the Tensorflow documentation).

Do you need the whole dataset at once? Or do you use batches? — Lau, Oct 24 '18 at 19:50
Even though [this thread](https://stackoverflow.com/questions/14262433/large-data-work-flows-using-pandas?rq=1) isn't about Tensorflow it could still give you many ideas. — Mohan Radhakrishnan, Oct 25 '18 at 09:24

score 0 · Answer 1 · answered Oct 26 '18 at 04:40

Looks like your data is large but still small enough to fit in memory? If so, you are on the right track with tf.data.Dataset.from_generator(). You could then shuffle and batch with something like

import itertools

# your data
data = range(1024)
def gen():
  for item in data:
    yield data

ds = Dataset.from_generator(
    gen, tf.int64, tf.TensorShape([])).shuffle(buffer_size=128).batch(batch_size=4)
value = ds.make_one_shot_iterator().get_next()

sess.run(value)  # array([0, 1, 2, 3])

Alternatively, you could dump your data to a TFRecord file and read from it using the TFRecordDataset. This test should help you get started.

Consuming big data by Tensorflow

1 Answers1