UPDATE:
I found the source code of GPUDevice, it hard-coded max streams to 1, may I know the know reason?
GPUDevice(const SessionOptions& options, const string& name, Bytes memory_limit, const DeviceLocality& locality, TfGpuId tf_gpu_id, const string& physical_device_desc, Allocator* gpu_allocator, Allocator* cpu_allocator) : BaseGPUDevice(options, name, memory_limit, locality, tf_gpu_id, physical_device_desc, gpu_allocator, cpu_allocator, false /* sync every op */, 1 / max_streams /) { if (options.config.has_gpu_options()) { force_gpu_compatible_ = options.config.gpu_options().force_gpu_compatible(); }
======================================
I am wondering whether TensorFlow(1.x version) supports multi-thread or multi-stream on a single GPU. If not, I am curious the underlying reasons, TF did this on some purposes or some libs like CUDA prevents TF from providing or some other reasons?
Like some previous posts[1,2], I tried to run multiple training ops in TF, i.e. sees.run([train_op1, train_op2],feed_dict={...}), I used the TF timeline to profile each iteration. However, TF timeline always showed that two train ops run sequentially (although timeline is not accurate[3], the wall time of each op suggests sequential running). I also looked at some source code of TF, it looks like the each op are computed by in device->ComputeAsync() or device->Compute(), and the GPU is blocked when computing an op. If I am correct, one GPU can only run a single op each time, which may lower GPU utilization.
1.Running multiple tensorflow sessions concurrently
2.Run parallel op with different inputs and same placeholder
3.https://github.com/tensorflow/tensorflow/issues/1824#issuecomment-244251867