4

I need help optimizing a custom TensorFlow model. I have a 40GB ZLIB compressed .TFRecords file containing my training data. Each sample consists of two 384x512x3 images and a 384x512x2 vector field. I am loading my data as follows:

    num_threads = 16
    reader_kwargs = {'options': tf.python_io.TFRecordOptions(tf.python_io.TFRecordCompressionType.ZLIB)}
    data_provider = slim.dataset_data_provider.DatasetDataProvider(
                        dataset,
                        num_readers=num_threads,
                        reader_kwargs=reader_kwargs)
    image_a, image_b, flow = data_provider.get(['image_a', 'image_b', 'flow'])

    image_as, image_bs, flows = tf.train.batch(
        [image_a, image_b, flow],
        batch_size=dataset_config['BATCH_SIZE'], # 8
        capacity=dataset_config['BATCH_SIZE'] * 10,
        num_threads=num_threads,
        allow_smaller_final_batch=False)

However, I am only getting about 0.25 to 0.30 global steps / second. (SLOW!)

Here is my TensorBoard dash for the parallel reader. It is at 99%-100% consistently. enter image description here

I plotted my GPU usage over time (% per sec). It looks data starved, but I'm not sure how to fix this. I've tried increasing/decreasing the number of threads but it doesn't seem to make a difference. I am training on an NVIDIA K80 GPU with 4 CPUs and 61GB of RAM.

GPU Usage

How can I make this train faster?

Sam P
  • 1,821
  • 13
  • 26

1 Answers1

0

If your examples are small then using DataSetProvider will not lead to satisfying results. It does only read one example at a time, which can be a bottle neck. I already added a feature request on github.

In the meantime, you'll have to roll with your own input queue that uses read_up_to:

  batch_size = 10000
  num_tfrecords_at_once = 1024
  reader = tf.TFRecordReader()
  # Here's where the magic happens:
  _, records = reader.read_up_to(filename_queue, num_tfrecords_at_once)

  # Batch records with 'enqueue_many=True'
  batch_serialized_example = tf.train.shuffle_batch(
      [records],
      num_threads=num_threads,
      batch_size=batch_size,
      capacity=10 * batch_size,
      min_after_dequeue=2 * batch_size,
      enqueue_many=True)

  parsed = tf.parse_example(
      batch_serialized_example,
      features=whatever_features_you_have)
  # Use parsed['feature_name'] etc. below
panmari
  • 3,627
  • 3
  • 28
  • 48
  • Thanks for the suggestion! I went ahead and tried it and no difference. Each TFRecord is fairly large (two 384x512x3 float32 and one 384x512x2 float32), so I don't think I'm having the same issue you have. – Sam P Jul 03 '17 at 14:56
  • Right, with records of this size it might not make a difference. Are you doing any preprocessing before batching? It might make sense to fix all these operations on CPU for preventing the automatic placer to place some actions on other devices, which might cause unnecessary copying. – panmari Jul 03 '17 at 18:07
  • I'm doing preprocessing *after* batching, explicitly on the CPU. – Sam P Jul 03 '17 at 18:13
  • 1
    Then I guess that must be your bottleneck. – panmari Jul 03 '17 at 19:42
  • Thanks! That resulted in a 25% speedup. I've fiddled with the network since originally posting, so it sped up from 0.95 steps/sec to 1.20 steps/sec. Still pretty slow, but that could just be due to the preprocessing + network size. – Sam P Jul 03 '17 at 20:12
  • ^ To clarify, I moved the preprocessing to before batching instead of after. – Sam P Jul 03 '17 at 20:19