Tensorflow shows different behaviour using cpu (tensorflow) vs gpu (tensorflow-gpu) backend

Question

I am using code ( https://github.com/vuptran/cardiac-segmentation ) which makes use of tensorflow.

If I use the tensorflow gpu backend the code works fine.

However, if I use the plain tensorflow, it runs without errors, but behaves differently (in that it produces completely nonsensical results).

1. Any ideas of possible causes why the gpu backend would cause the same code to produce different results from the cpu backend, and where to look for evidence of this happening in the code?¹

2. Alternatively, is there a way to install the tensorflow-gpu backend, but somehow hack it into running from the CPU?²

^{1. This has been discussed in the bug tracker previously, but the author says the code is not gpu-specific, and is not aware of any obvious reason why tensorflow should behave differently when using the cpu backend.}
^{2. Simply setting CUDA_ENABLED_DEVICES='' will not work, because the code will fail at the point of attempting to import tensorflow, since tensorflow will throw an error about not being able to find the relevant libraries.}

Two questions to understand you better: 1) which script are you running? 2) do you train a model or use trained graph for inference? — Dmytro Prylipko, Feb 05 '19 at 22:25
@DmytroPrylipko I'm running the `train_sunnybrook.py` script in that collection, which, yes, trains a model. If you run the script using tensorflow-gpu you get a good model according to the accuracy metrics. If you run it using plain tensorflow you get rubbish metrics (zeros and nans). — Tasos Papastylianou, Feb 06 '19 at 12:23
This post provides useful hints for debugging the NaNs source. For instance, using tf.check_numerics. NaNs are often due to numeric overflows or underflows. Try to reduce the learning rate and see whether you get NaNs later. Also, I can see the model uses SGD optimizer with fairly large learning rate. Try Adam instead. https://stackoverflow.com/questions/48347728/how-to-find-the-origin-of-a-tensorflow-nan-error-on-a-multi-gpu-system-with-nvid — Dmytro Prylipko, Feb 06 '19 at 14:18
@DmytroPrylipko thank you for your thoughts. Alas, changing learning rate, seed, and optimizer did not have an effect. Not sure if the cause given in that post and its links is related, but at least it's good to see that altering the GPU configuration can be a source of error more generally, which means the above is not particularly unusual behaviour in this case. — Tasos Papastylianou, Feb 06 '19 at 16:31
Well, NaNs are not something rare in deep learning, there is even TerminateOnNaN callback in Keras :) . I think you should consequently search for the origin where NaN happens for the first time. Good luck :) — Dmytro Prylipko, Feb 06 '19 at 17:20
@DmytroPrylipko thanks Dmytro. It was too much hassle to deal with tbh. I've found myself a computer with an NVidia GPU instead and now the code runs as intended, without any trace of pathological numbers. I'm still in the dark as to why there's such a huge difference in behaviour though ... — Tasos Papastylianou, Feb 07 '19 at 23:27
I finally found [this post](https://stackoverflow.com/questions/43221730/tensorflow-same-code-but-get-different-result-from-cpu-device-to-gpu-device) This is probably what I'm seeing as well, and the links there are very informative. Voting to close my question as a duplicate. — Tasos Papastylianou, Feb 07 '19 at 23:34
Possible duplicate of [Tensorflow same code but get different result from CPU device to GPU device](https://stackoverflow.com/questions/43221730/tensorflow-same-code-but-get-different-result-from-cpu-device-to-gpu-device) — Tasos Papastylianou, Feb 07 '19 at 23:35

Tensorflow shows different behaviour using cpu (tensorflow) vs gpu (tensorflow-gpu) backend

0 Answers0