CUDA: How many default streams are there on a single device?

Question

I have read the documentation carefully but still, confused due to the large amount of information for different CUDA versions.

Is it that there is only one default stream on the entire device or there is one default stream per-process on the HOST CPU? If the answer depends on the version of CUDA, could you also list the situation for different CUDA versions?

What observable difference would that make? – tera Aug 26 '19 at 11:04 — tera, Aug 26 '19 at 11:04

Jakub Klinkovský · Accepted Answer · 2019-08-26T19:18:40.807

By default, CUDA has a per-process default stream. There is a compiler flag --default-stream per-thread which changes the behaviour to per-host-thread default stream, see the documentation.

Note that streams and host threads are programming-level abstractions for hardware details. Even with a single process, there is a limited number of streams you can use concurrently, depending on the hardware. For example, on the Fermi architecture, all streams were multiplexed into a single hardware queue, but since Kepler there are 32 separate hardware queues (see CUDA Streams: Best Practices and Common Pitfalls).

Since the programming guide does not talk about multiple processes in this part, I believe these abstractions do not define the behaviour of multi-process scenarios. As for multi-process, the right term is "CUDA context" which is created for each process and even each host thread (when using the runtime API). The question of how many contexts can be active on a device at the same time: the guide says in 3.4 Compute modes that in the default mode, "Multiple host threads can use the device". Since the following exclusive-process mode talks about CUDA contexts instead, I assume that this means that the description of the default mode covers even multiple host threads from multiple processes.

For more info about multi-process concurrency see e.g. How do I use Nvidia Multi-process Service (MPS) to run multiple non-MPI CUDA applications?, Unleash legacy MPI codes with Kepler's Hyper-Q and CUDA Streams: Best Practices and Common Pitfalls.

Finally, note that multi-process concurrency works this way since the Kepler architecture, which is the oldest supported architecture nowadays. Since the Pascal architecture there is support for compute preemption (see 3.4 Compute modes for details).

`--default-stream per-thread` was introduced with v7.0. See the [CUDA 7.0 Release Notes](http://developer.download.nvidia.com/compute/cuda/7_0/Prod/doc/CUDA_Toolkit_Release_Notes.pdf), chapter 2.1, first bullet point. — BlameTheBits, Aug 26 '19 at 08:41
However, I have found the following from the CUDA9.0 doc: For code that is compiled using the --default-stream legacy compilation flag, the default stream is a special stream called the NULL stream and each device has a single NULL stream used for all host threads. The NULL stream is special as it causes implicit synchronization as described in Implicit Synchronization. For code that is compiled without specifying a --default-stream compilation flag, -- default-stream legacy is assumed as the default. — Nyaruko, Aug 26 '19 at 08:44
The legacy default stream is per device. This is readily provable with a trivial code in a multi-device system. This statement in your answer "Note that it is not even per-device" is incorrect. — Robert Crovella, Aug 26 '19 at 13:46

CUDA: How many default streams are there on a single device?

1 Answers1

Linked