The statement is true of the Tesla architecture but it is incorrect for Fermi and Kepler. It is easier to look at the SM in terms of warp schedulers. On each cycle the warp scheduler selects an eligible warp (a warp that is not stalled) and dispatches one or two instructions from the warp to execution units. The number of execution units per SM is documented in the Fermi and Kepler whitepapers. CUDA cores roughly equate to the number of execution units that can perform integer and single precision floating point operations. There are additional execution units for load/store operations, branching, etc.
Compute Capability 1.x (Tesla)
- 1 warp scheduler per SM
- Dispatch 1 instruction per warp scheduler
Compute Capability 2.0 (Fermi 1st Generation)
- 2 warp schedulers per SM
- Dispatch 1 instruction per warp scheduler
Compute Capability 2.1 (Fermi 2nd Generation)
- 2 warp schedulers per SM
- Dispatch 1 or 2 instructions per warp scheduler
Compute Capability 3.x (Kepler)
- 4 warp schedulers per SM
- Dispatch 1 or 2 instructions per warp scheduler