Independent Thread Scheduling since Volta

Question

Nvidia introduced a new Independent Thread Scheduling for their GPGPUs since Volta. In case CUDA threads diverge, alternative code paths are not executed in blocks but instruction-wise. Still, divergent paths can not be executed at the same time since the GPUs are SIMT as well. This is the original article:

https://developer.nvidia.com/blog/inside-volta/ (scroll down to "Independent Thread Scheduling").

I understood what this means. What I don't understand is, in which way this new behavoir accelerates code. Even the before/after diagrams in the above article do not reflect an overall speed-up.

My question: Which kinds of divergent algorithms will run faster on Volta (and newer) due to the described new scheduling?

One small speedup advantage is that the scheduler can choose the divergent branch for which resources are free at that time. — Sebastian, Feb 04 '22 at 21:36

score 8 · Accepted Answer · answered Feb 04 '22 at 14:37

The purpose of the feature is not necessarily to accelerate code.

An important purpose of the feature is to enable reliable use of programming models such as producer-consumer within a warp (amongst threads in the same warp) that would have been either brittle or prone to hang using the previous thread schedulers pre-volta.

The typical example IMO of which you can find various examples here on the cuda tag, is people trying to negotiate for atomic locks among threads in the same warp. This would have been "brittle" (and here) or not workable (hangs) on previous architectures. It works well, on volta, in my experience.

Here is another example of an algorithm that just hangs on pre-volta, but "works" (does not hang) on volta+.

Independent Thread Scheduling since Volta

1 Answers1