Mixed precision not enabled with TF1.4 on Tesla V100

Question

I was interested in testing my neural net (an Autoencoder that serves as a generator + a CNN as a discriminator) that uses 3dconv/deconv layers with the new Volta architecture and benefit from the Mixed-Precision training. I compiled the most recent source code of Tensorflow 1.4 with CUDA 9 and CudNN 7.0 and cast all the trainable variables used by my conv/deconv layers to tf.float16. Also, all my input and output tensors have sizes that are multiple of 8.

Unfortunately, I do not see any substantial speed improvement with this configuration, the training time is roughly similar to when using tf.float32. My understanding is that with the Volta architecture and cuDNN 7.0, Mixed Precision should be automatically detected by TF and hence enable the use of Tensor Core math. Am I wrong, or is there anything I should do to enable it? I also tried the TF1.5 nighlty build, and it seems that it is even slower than my custom 1.4.

I would appreciate if any dev involved in Tensorflow could answer this.

EDIT: After talking with NVIDIA tech support, it seems that, while supporting float16,TF integrates mixed-precision acceleration for simple 2D conv Ops, but not for 3D conv Ops as of now.

I have a V100 too, and also feel frustrated by the lack of support by tensorflow. Lack of support for convolution groups is also annoyting: https://github.com/tensorflow/tensorflow/issues/3332 . That post gives me hope that something will be published soon: https://github.com/tensorflow/tensorflow/issues/12474#issuecomment-338309705 . If I become impatient, I'll try Caffe2. It has had support for the newest cudnn features for a long time. — Rémi, Nov 13 '17 at 21:16
Did you try the steps proposed by NVIDIA here: http://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#tensorflow As Tf 1.4 is now available, we should just change the code to support faster training. Also look at blog: https://devblogs.nvidia.com/parallelforall/mixed-precision-training-deep-neural-networks/ — melgor89, Nov 15 '17 at 10:04
@melgor89 I did and I also used the TF container built by NVIDIA. It seems that while mixed-precision is supported for simple matmul or 2D convolutional Ops, mixed-precision is not enabled for 3D conv Ops yet. — Julien Jorda, Nov 16 '17 at 18:24

melgor89 · Answer 1 · 2017-11-16T05:56:19.810

Based on NVIDIA documentation I run benchmark with FP16 (TensorCore). For that I modyfied alexnet_benchmark delivered by tensorflow: https://gist.github.com/melgor/946b9643aa25dd3839a86804fc580741

Overall, AlexNet is only 35% faster, not so much. I was hoping to get ~2x faster. Also, maybe Resnet would make bigger differnce. The nice thing is that I can fit model with batch_size = 5120 (fp32 cannot), one FB pass take 0.653, so training ImageNet by 90 epochs take ~4h.

batch_size=512 alexnet_fp32: Forward-backward across 100 steps, 0.099 +/- 0.000 sec / batch alexnet_fp16: Forward-backward across 100 steps, 0.064 +/- 0.000 sec / batch

Edit:

I manage to run ResNet models on FP16 (but without BatchNorm, for some reason BN does not work with fp16):

batch_size=256 resnet50_fp32: Forward-backward across 100 steps, 0.575 +/- 0.001 sec / batch resnet50_fp16: Forward-backward across 100 steps, 0.504 +/- 0.001 sec / batch

batch_size=128 resnet152_fp32: Forward-backward across 100 steps, 0.757 +/- 0.001 sec / batch resnet152_fp16: Forward-backward across 100 steps, 0.581 +/- 0.010 sec / batch

The gain at ResNet is even smaller. Look like FP16 does not have a lot of gain at V100, not sure why. Maybe the suppor of TensorCore is not fully intergrated currently.

Thanks for the pointer to the NVIDIA documentation. I had not found it. — Rémi, Nov 15 '17 at 19:02
Yeah I did a similar thing last week and I observed more or less the same thing. I don't know if it is because tensorflow needs more work to support TensorCore or something else. — derekhh, Nov 20 '17 at 21:59
When you say 35% faster, is it a one-to-one comparison, i.e is your AlexNet with double precision run on a tesla V100 as well ? — Julien Jorda, Dec 06 '17 at 02:37
Yes, it is comparison between AlexNet on V100. One running at FP32, second on FP16 — melgor89, Dec 07 '17 at 18:10

Stringer Bell · Answer 2 · 2018-02-08T12:23:16.683

I am quite interested in this topic, does anyone have any update on the current status of Volta Tensor Cores integration with Tensorflow ? I have ran experiments to test speed with the Volta V100 GPU and tensorflow 1.5 cuda 9.0 cudnn and came to the following conclusions:

Training using Volta V100 is not faster than traning with GeForce 1080 Ti whereas it should be materially faster. Using float16 or float32 does not change anything
Training using Volta V100 with float 16 is not faster than training using Volta V100 with float32. The volta GPUs are supposed to be optimized for float16 so I was expecting material improvement of speed.

So basically I got the same conclusions as OP did: Volta GPUs are not yet fully supported by Tensorflow.

This PR on tensorflow github seem to relate to the topic although I have not yet tested these changes: https://github.com/tensorflow/tensorflow/pull/16253

I came to similar conclusions training a complex sequence to sequence model. I trained the model on a system with 2x GTX 1080Ti's with FP16 and on a AWS P3.16xLarge instance with 8x V100's using the same code (so also FP16). The P3.16xLarge instance was only 4.125 as fast, so the training performance per GTX 1080 Ti or per V100 was exactly the same. Very disappointing. I'm still trying to tweak my software to see some speed up. — Visionscaper, Aug 29 '18 at 21:40

score 0 · Answer 3 · answered Jan 05 '18 at 23:33

I beleive tensorflow is not using the proper cudnn API calls for determining the best algorithms. I just grepped tensorflow code for cudnnGetConvolutionForwardAlgorithm_v7 and cudnnFindConvolutionForwardAlgorithmEx

and no matches. I am going to raise a ticket with Tensorflow.

Mixed precision not enabled with TF1.4 on Tesla V100

3 Answers3