Is Tensorflow Dataset API slower than Queues?

Question

I replaced CIFAR-10 preprocessing pipeline in the project with Dataset API approach and it resulted in performance decrease of about 10-20%.

Preporcessing is rather standart: - read image from disk - make random/crop and flip - shuffle, batch - feed to the model

Overall i see that batche processing is now 15% faster, but every once in a while (or, more precisely, whenever I reinitialize dataframe or expect reshuffling) the batch is being blocked for up long time (30 sec) which totals to slower epoch-per-epoch processing.

This behaviour seems to do something with internal hashing. If I reduce N in ds.shuffle(buffer_size=N) delays are shorter but proportionally more frequent. Removing shuffle at all results to delays as if buffer_size was set to dataset size.

Can somebody explain internal logic of Dataset API when it comes to reading/caching? Is there any reason at all to expect Dataset API to work faster than manually created Queues?

I am using TF 1.3.

score 10 · Answer 1 · edited Jun 20 '20 at 09:12

If you implement the same pipeline using the tf.data.Dataset API and using queues, the performance of the Dataset version should be better than the queue-based version.

However, there are a few performance best practices to observe in order to get the best performance. We have collected these in a performance guide for tf.data. Here are the main issues:

Prefetching is important: the queue-based pipelines prefetch by default and the Dataset pipelines do not. Adding dataset.prefetch(1) to the end of your pipeline will give you most of the benefit of prefetching, but you might need to tune this further.
The shuffle operator has a delay at the beginning, while it fills its buffer. The queue-based pipelines shuffle a concatenation of all epochs, which means that the buffer is only filled once. In a Dataset pipeline, this would be equivalent to dataset.repeat(NUM_EPOCHS).shuffle(N). By contrast, you can also write dataset.shuffle(N).repeat(NUM_EPOCHS), but this needs to restart the shuffling in each epoch. The latter approach is slightly preferable (and truer to the definition of SGD, for example), but the difference might not be noticeable if your dataset is large.

We are adding a fused version of shuffle-and-repeat that doesn't incur the delay, and a nightly build of TensorFlow will include the custom tf.contrib.data.shuffle_and_repeat() transformation that is equivalent to dataset.shuffle(N).repeat(NUM_EPOCHS) but doesn't suffer the delay at the start of each epoch.

Having said this, if you have a pipeline that is significantly slower when using tf.data than the queues, please file a GitHub issue with the details, and we'll take a look!

score 0 · Answer 2 · answered May 27 '19 at 07:10

Suggested things didn't solve my problem back in the days, but I would like to add a couple of recommendations for those, who don't want to learn about queues and still get the most out of TF data pipeline:

Convert your input data into TFRecord (as cumbersome as it might be)
Use recommended input pipeline format

.

files = tf.data.Dataset.list_files(data_dir)
ds = tf.data.TFRecordDataset(files, num_parallel_reads=32)
ds = (ds.shuffle(10000)
    .repeat(EPOCHS)
    .map(parser_fn, num_parallel_calls=64)
    .batch(batch_size)
)
dataset = dataset.prefetch(2)

Where you have to pay attention to 3 main components:

num_parallel_read=32 to parallelize disk IO operations
num_parallel_calls=64 to parallelize calls to parser function
prefetch(2)

Is Tensorflow Dataset API slower than Queues?

2 Answers2