3

I am reading Professional CUDA C Programming, and in GPU Architecture Overview section:

CUDA employs a Single Instruction Multiple Thread (SIMT) architecture to manage and execute threads in groups of 32 called warps. All threads in a warp execute the same instruction at the same time. Each thread has its own instruction address counter and register state, and carries out the current instruction on its own data. Each SM partitions the thread blocks assigned to it into 32-thread warps that it then schedules for execution on available hardware resources.

The SIMT architecture is similar to the SIMD (Single Instruction, Multiple Data) architecture. Both SIMD and SIMT implement parallelism by broadcasting the same instruction to multiple execution units. A key difference is that SIMD requires that all vector elements in a vector execute together in a unifed synchronous group, whereas SIMT allows multiple threads in the same warp to execute independently. Even though all threads in a warp start together at the same program address, it is possible for individual threads to have different behavior. SIMT enables you to write thread-level parallel code for independent, scalar threads, as well as data-parallel code for coordinated threads. The SIMT model includes three key features that SIMD does not:
➤ Each thread has its own instruction address counter.
➤ Each thread has its own register state.
➤ Each thread can have an independent execution path.

The first paragraph mentions "All threads in a warp execute the same instruction at the same time.", while in the second paragraph, it says "Even though all threads in a warp start together at the same program address, it is possible for individual threads to have different behavior.". It makes me confused, and the above statements seems contradictory. Could anyone can explain it?

Community
  • 1
  • 1
Nan Xiao
  • 16,671
  • 18
  • 103
  • 164

1 Answers1

6

There is no contradiction. All threads in a warp execute the same instruction in lock-step at all times. To support conditional execution and branching CUDA introduces two concepts in the SIMT model

  1. Predicated execution (See here)
  2. Instruction replay/serialisation (See here)

Predicated execution means that the result of a conditional instruction can be used to mask off threads from executing a subsequent instruction without a branch. Instruction replay is how a classic conditional branch is dealt with. All threads execute all branches of the conditionally executed code by replaying instructions. Threads which do not follow a particular execution path are masked off and execute the equivalent of a NOP. This is the so-called branch divergence penalty in CUDA, because it has a significant impact on performance.

This is how lock-step execution can support branching.

Community
  • 1
  • 1
talonmies
  • 70,661
  • 34
  • 192
  • 269
  • May I know if there is any significant difference between threads that are masked off in these two cases? The programming guide says "Instructions with a false predicate do not write results, and also do not evaluate addresses or read operands". Is this very different from NOP? Also, for the 2nd case I'm not sure what instructions get replayed at which stage. Thanks. – biubiuty Dec 28 '16 at 01:57
  • If "all threads in a warp execute the same instruction at the same time", why "each thread has its own instruction address counter"? Could all threads in 1 warp share the 1 instruction address counter? – Thomson Sep 21 '19 at 21:37
  • @Thomson: Because predication and instruction replay require thread level state to work. Threads share do share a warp level instruction counter, and they require their own state relative to that warp level instruction counter. – talonmies Sep 24 '19 at 05:38