CUDA-like optimization on Tensorflow-GPU

Question

I am trying to implement a neural network architecture (Self Organizing Maps) for execution on GPUs. I am exploring TensorFlow for this task.

In TensorFlow, I noticed that you just have to specify gpu as the device to execute something on the gpu like in this post. It seems that the way the operations are parallelized is decided by TF and the user does not have options to take optimization decisions. The "Optimizing for GPU" section on TensorFlow Performance Guide also does not talk about explicit control over parallelizing operations.

My question is, can I do CUDA-like optimization in TensorFlow? More elaborately, is it possible to define which operation will be parallelized (like defining CUDA kernels for parallel operations)?

score 1 · Accepted Answer · answered Jan 08 '18 at 14:11

1

Yes, but you probably don't want to.

At the most extreme you can define your own op (as described here: https://www.tensorflow.org/extend/adding_an_op). You can implement it as a GPU Kernel and write whatever you want.

You probably don't want to. The default operations are likely well optimized. I doubt you would be able to squeeze anything out significant out of them.

You can decide the device placement for each individual operation (by using tf.device), but you will incur data transfer overhead every time you switch. This should cover the cases where there's some operation that it slow to execute on the GPU.

If you want to process part of the data on CPU and part on the GPU you can slice your data and do 2 operations (one on CPU and one on GPU).

answered Jan 08 '18 at 14:11

Sorin

11,863
22
26

So does it mean that simply adding "with tf.device("/gpu:0")" will make TF run the code on GPU and CPU in an optimized manner? – Rohit Gavval Jan 09 '18 at 04:05
@RohitGavval "with tf.device("/gpu:0")" will make TF run the code on the GPU in an optimized manner. If you want to use both CPU and GPU you need to have some operations under "with tf.device('/gpu:0')" and some outside or under "with tf.device('/cpu:0')". Which operations go where it's up to you. – Sorin Jan 09 '18 at 13:38
When I put all operations under "with tf.device("/gpu:0")", I noticed that some of the operations were being ported to GPU and some were being performed on CPU. Doesn't it mean that "with tf.device("/gpu:0")" automatically decides which operations are best performed on GPU and which ones are best performed on CPU and assigns them accordingly? – Rohit Gavval Jan 10 '18 at 05:09
@RohitGavval Not all operations can be executed on the GPU. I expect you saw some of them. If you think some operation should be on the GPU please post another question and include your code. – Sorin Jan 10 '18 at 09:27

score 0 · Answer 2 · answered May 24 '20 at 18:42

By default, in TF, in graph mode (not in eager mode), everything, all the TF ops run in parallel. There is a thread pool for that, and its size is controlled via inter_op_parallelism_threads. (See also.)

That does not necessarily mean that e.g. multiple matmul will really run in parallel, if they are internally synchronized. That is the case for most CUDA ops, as there is only a single CUDA stream. See here.

CUDA-like optimization on Tensorflow-GPU

2 Answers2