CUDA cores is the number of single precision floating point units in the SM. The SM has other execution units including special function units (RSQRT, COS, SIN, ...), double precision units, load store units, texture units, branch unit, etc.
The Fermi, Kepler-gk10x, Kepler-gk110 and Maxwell whitepapers contain additional information on the type and number of execution units in the SMs.
The instruction throughput of Arithmetic Instructions can be found in the CUDA Programming Guide in the Table of Throughput of Arithmetic Instructions.
As a developer you want to understand the rate an SM can issue instructions which is documented in the throughput table. The rate is determine by the throughput of the warp schedulers as well as the throughput of the execution units (again, not just the CUDA cores).
CC1.x Tesla
- 1 warp scheduler per SM
- Each warp scheduler selects 1 eligible warp and issues 1 instruction per 4 cycles.
CC2.x Fermi
- 2 warp schedulers per SM
- CC2.0 Each warp scheduler selects 1 eligible warp per tepid clock and issues 1 instruction.
- CC2.x Each warp scheduler selects 1 eligible warp per tepid clock and issues up to 2 independent instructions.
- The math pipes run at hot clock (2x tepid clock). This often results in people stating that instructions are issued over 2 clock cycles. It easier to think in terms of tepid clock.
CC3.* Kepler
CC5.0 Maxwell
- 4 warp schedulers per SM
- Each warp scheduler selects 1 eligible warp and issues up to 2 independent instructions.