Load/Store Units (LD/ST) and Special Function Units (SFUs) for the Kepler architecture

Question

In the Kepler architecture whitepaper, NVIDIA states that there are 32 Special Function Units (SFUs) and 32 Load/Store Units (LD/ST) on a SMX.

The SFU are for "fast approximate transcendental operations". Unfortunately, I don't understand what this is supposed to mean. On the other hand, at Special CUDA Double Precision trig functions for SFU it is said, that they only work in single precision. Is this still correct on a K20Xm?

The LD/ST units are obviously for storing and loading. Is any memory load/write required to go through one of theses? And are they also used as a single warp? In other words, can there be only one warp which is currently writing or reading?

Cheers, Andi

How does NVIDIA GPUs handle double precision transcendental functions? That's an interesting question and one I hadn't considered before. I hope someone can answer. If not, I think we can make a pretty good guess after measuring throughput and number of valid bits in the results. — Roger Dahl, Dec 09 '13 at 16:04
@RogerDahl SFUs work on single precision only, as remarked below. Their hardware implementation is based on quadratic interpolation in ROM tables using fixed-point arithmetic, as described in the paper _Stuart F. Oberman and Michael Siu. A high-performance area-efficient multifunction interpolator. In Proceedings of the 17th IEEE Symposium on Computer Arithmetic (Cap Cod, USA), pages 272–279, July 2005._ This is an answer that njuffa gave me some time ago on the NVIDIA forum, see [Fermi and Kepler GPU Special Function Units](https://devtalk.nvidia.com/default/topic/531855/?comment=3746296). — Vitality, Dec 09 '13 at 21:44

score 4 · Accepted Answer · edited May 23 '17 at 12:33

The SFU are for "fast approximate transcendental operations"

SFUs compute functions like __cosf(), __expf() etc.

On the other hand here is said, that they only work in single precision, is this still correct on a K20Xm?

According to recent CUDA C Programming Guide, section G.5.1 they still only work in single precision.

It makes some sense, since if you need double precision it's unlikely you would use inaccurate math functions. You can refer to this answer for suggestions on double-precision arithmetic optimizarions.

The implementation details of double-precision operations could be found in /usr/local/cuda-5.5/include/math_functions_dbl_ptx3.h (or wherever your CUDA Toolkit is installed). E.g. for sin and cos it uses Payne-Hanek argument reduction followed by Taylor expansion (up to the order 14).

For double precision calculations, SFUs seem to be used only in __internal_fast_rcp and __internal_fast_rsqrt, which in turn are used in acos, log, cosh and several other functions (see math_functions_dbl_ptx3.h). So most of the time they stall, like LD/ST units stall if there's no ongoing memory transactions.

Is any memoryload/write required to go through one of theses?

Yes, each access to global memory.

And are they also used as a single warp? In other words can there be only one warp which is currently writing or reading?

The number of units constrains only the number of instructions issued each cycle. I.e. each clock cycle 32 read instructions could be issued, and 32 results could be returned.

One instruction can read/write up to 128 bytes, so if each thread in warp reads 4 bytes and they are coalesced, then whole warp would require a single load/store instruction. If accesses are uncoalesced, then more instruction should be issued.

Moreover, units are pipelined, meaning multiple read/store request could be executing concurrently by single unit.

Okay thanks, so if I understood correctly: When I dont have any single precision transcendental operations those units will just stay idle? Is there any other way I could utilize them? — user2267896, Dec 09 '13 at 21:05
The SFUs only support six (single precision) operations: sin/cos, exp/log, rcp/rsqrt. They can be used for transcendentals where the single precision instruction gives a good approximation for refinement (e.g. reciprocal, reciprocal square root). But for transcendentals like sine and cosine where the single approximation doesn't help, you won't see any SFU instructions in the generated microcode. The double precision implementation of the math library should automatically make use of SFU where it will benefit. — ArchaeaSoftware, Dec 10 '13 at 01:30
@ArchaeaSoftware: Do you know how the DP transcendentals are implemented and what type of performance and precision they have? — Roger Dahl, Dec 10 '13 at 18:51
Roger, I don't know that there's a good answer to your question. I have a passing familiarity with the mathlib from working shoulder-to-shoulder with the engineer who owned it through CUDA 3.x or so. The precision is documented in Table 7 of the CUDA programming guide (http://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf). Both the precision and the performance tend to improve over time, since the CUDA team is constantly working to improve them. — ArchaeaSoftware, Dec 10 '13 at 21:32

score 2 · Answer 2 · answered Dec 09 '13 at 16:34

Don't accept this as an answer -- we're hoping that someone will come along and answer your question about double precision transcendental operations. I just wanted to address the second part of your question, about the LD/ST units.

The LD/ST units are obviously for storing and loading.

Yes.

Is any memoryload/write required to go through one of theses?

Yes.

And are they also used as a single warp?

Yes, all active threads in a warp always issue the same type of instruction in the same clock cycle. If that instruction is a load or store, it gets issued to the LD/ST units. If a thread is inactive (due to looping or conditional execution), the corresponding LT/ST unit stays idle.

In other words can there be only one warp which is currently writing or reading?

No, the LD/ST units can accept one load or store operation per clock, even though memory latency can be several hundred cycles. So, when one warp issues a load instruction, the LD/ST units will start working on retrieving that data. Instructions in the warp that depend on the data become ineligible to be issued until the data arrives. In the next clock cycle, the warp may still execute other independent instructions (instruction-level parallelism). Even other, independent load or store instructions. Another warp that is eligible to be scheduled may also, in the next clock cycle, issue another load instruction and itself go into a waiting state (thread-level parallelism). At that point, the LD/ST units are keeping track of two pending results. Due to caching and coalescing, it is possible that the data for the second warp arrives first. When data for a warp arrives it gets assigned to the registers designated in the instruction and that particular data dependency is then resolved.

Hey thanks, so basically it can only one set of 32 load/store operations be issued, but there can be a lot of those in flight? (see answer of aland). In other words: For an optimal usage we need to issue enough load/store operations to keep those units busy while not having to much actual throughput (hardware limit)? — user2267896, Dec 09 '13 at 21:09

Load/Store Units (LD/ST) and Special Function Units (SFUs) for the Kepler architecture

2 Answers2