I need help optimizing a custom TensorFlow model. I have a 40GB ZLIB compressed .TFRecords file containing my training data. Each sample consists of two 384x512x3 images and a 384x512x2 vector field. I am loading my data as follows:
num_threads = 16
reader_kwargs = {'options': tf.python_io.TFRecordOptions(tf.python_io.TFRecordCompressionType.ZLIB)}
data_provider = slim.dataset_data_provider.DatasetDataProvider(
dataset,
num_readers=num_threads,
reader_kwargs=reader_kwargs)
image_a, image_b, flow = data_provider.get(['image_a', 'image_b', 'flow'])
image_as, image_bs, flows = tf.train.batch(
[image_a, image_b, flow],
batch_size=dataset_config['BATCH_SIZE'], # 8
capacity=dataset_config['BATCH_SIZE'] * 10,
num_threads=num_threads,
allow_smaller_final_batch=False)
However, I am only getting about 0.25 to 0.30 global steps / second. (SLOW!)
Here is my TensorBoard dash for the parallel reader. It is at 99%-100% consistently.
I plotted my GPU usage over time (% per sec). It looks data starved, but I'm not sure how to fix this. I've tried increasing/decreasing the number of threads but it doesn't seem to make a difference. I am training on an NVIDIA K80 GPU with 4 CPUs and 61GB of RAM.
How can I make this train faster?