1

The Tensorflow Queues offered the advantage that data could be fetched and queued independently of the rest of the graph, allowing CPU/disk to pre-fetch data so that the GPUs don't run dry.

I've read in a blog that with the Dataset API, this is missing again. However the dataset shuffle() function allows a buffer_size which I would assume enables a buffer-queue? Is this the same as combining a Dataset API and Queue (see code below)? Is there a recommended way to create a proper, indipendent data-fetching queue?


Code example for Dataset API + Queue:

sample_set = tf.data.Dataset.from_generator(...)
sample = sample_set.make_one_shot_iterator().get_next()
sample_batch = tf.train.shuffle_batch([sample], batch_size=10,
                                       capacity=30, num_threads=1, 
                                       min_after_dequeue=1)

... is this the same as in pure Dataset API? (How can I define the number of threads here?)

sample_set = tf.data.Dataset.from_generator(...)
sample_set = sample_set.shuffle(buffer_size=30)
sample_set = sample_set.batch(10)
sample = sample_set.make_one_shot_iterator().get_next()
Honeybear
  • 2,928
  • 2
  • 28
  • 47
  • 1
    Check here it should hopefully answer: https://stackoverflow.com/questions/46444018/meaning-of-buffer-size-in-dataset-map-dataset-prefetch-and-dataset-shuffle Short answer is yes using Data api you still have preloading in the background while procesing on GPU – Burton2000 Mar 20 '18 at 16:29
  • Thanks, the "duplicate" and your link provide enough insights to give me all I need. `Dataset.prefetch()` and the `num_parallel_calls` of `Dataset.map()` basically offer everything for multi-threaded prefetching, making Queues obsolete (and then obviously combining Dataset API + Queues is a bad idea) – Honeybear Mar 20 '18 at 17:20

0 Answers0