17

Denormals are known to underperform severely, 100x or so, compared to normals. This frequently causes unexpected software problems.

I'm curious, from CPU Architecture viewpoint, why denormals have to be that much slower? Is the lack of performance is intrinsic to their unfortunate representation? Or maybe CPU architects neglect them to reduce hardware cost under the (mistaken) assumption that denormals don't matter?

In the former case, if denormals are intrinsically hardware-unfriendly, are there known non-IEEE-754 floating point representations that are also gapless near zero, but more convenient for hardware implementation?

Community
  • 1
  • 1
Michael
  • 5,775
  • 2
  • 34
  • 53
  • 6
    Denormals require additional normalization (on input) and denormalization (on output) steps. Some processor architectures expend additional hardware (shifters plus control logic) to handle this at full speed, e.g. NVIDIA GPUs, other processor architectures handle this by internal exception handling via microcode, e.g. most x86 CPUs, which saves hardware but is much slower. Justification for the latter approach is to handle frequent cases fast, and infrequent cases (e.g. infinities, NaNs, denormals) correctly with minimal hardware expenditure. – njuffa Apr 22 '16 at 00:08
  • 1
    @njuffa: why not make that an answer? – Simon Byrne Apr 22 '16 at 10:30
  • @SimonByrne I would, if I had the time to collect necessary references. I am hesitant to spend the time for finding references, as the question appears off-topic here (it is not really about programming, but hardware architecture), and thus may disappear at any time. – njuffa Apr 22 '16 at 15:06
  • @njuffa: I would appreciate a bit more info on that. For example, I don't understand the need for additional normalization: when you manually add number of different lengths, such as 123456e7 and 890, you don't need to normalize the short one w/out an exponent (890), you only need to normalize the one with exponent (1234560000000). All denormals have the same exponent, so to speak, and a bunch of leading zeros; normal have variable exponents. I would expect more trouble implementing normals than denormals. – Michael Apr 22 '16 at 17:18
  • 2
    E.M. Schwarz, M. Schmookler, and S.D.Trong, "Hardware implementations of denormalized numbers". In: *Proceedings 16th IEEE Symposium on Computer Arithmetic*, June 15 2003, pp. 70-78. ([online version](http://www.dec.usc.es/arith16/papers/paper-149.ps)) – njuffa Apr 22 '16 at 18:03
  • @njuffa: IIRC, Intel SnB-family FPU hardware can handle denormals at full speed without a microcode assist in some limited cases (e.g. adding two denormals to produce a denormal). This is a change from previous hardware that needed microcode assists in every denormal case. ([On Nehalem i7 930, NaNs slow down x87 26x more than than SSE](http://stackoverflow.com/questions/31875464/huge-performance-difference-26x-faster-when-compiling-for-32-and-64-bits/31879376#31879376). NaN may not slow down SSE math at all on most CPUs.) – Peter Cordes Apr 23 '16 at 04:39
  • 1
    @PeterCordes Note that I said "*most* x86 CPUs". It may even vary within the same processor: The original AMD Athlon processor had hardware support for denormals in the read path (by dynamically lengthening the pipeline; there is a patent on that), but microcode exception handler for denormals in the store path (to avoid the overhead of denormal support slowing down the store to load forwarding path in the common case of no denormals). – njuffa Apr 23 '16 at 07:38
  • 1
    Handling of special situations involving zeros, infinities, and NaNs is more often performed by special hardware (e.g. on various AMD x86 processors), and is a slightly different issue from handling denormals, as the hardware overhead can be kept fairly small and involves parallel HW paths that typically don't impact length of the pipeline, whereas denormal handling typically impacts pipeline length (increases the length). – njuffa Apr 23 '16 at 07:49
  • It's a good question... on the face of it, it seems like denormal inputs could be handled simply by switching the implicit digit between 0 and 1 based on a nonzero exponent. Yet anecdotal evidence suggests that denormal handling is slow even if the outputs aren't denormal. – Sneftel Apr 24 '16 at 10:02

2 Answers2

9

On most x86 systems, the cause of slowness is that denormal values trigger an FP_ASSIST which is very costly as it switches to a micro-code flow (very much like a fault).

see for example - https://software.intel.com/en-us/forums/intel-performance-bottleneck-analyzer/topic/487262

The reason why this is the case, is probably that the architects decided to optimize the HW for normal values by speculating that each value is normalized (which would be more common), and did not want to risk the performance of the frequent use case for the sake of rare corner cases. This speculation is usually true, so you only pay the penalty when you're wrong. These trade-offs are very common in CPU design since any investment in one case usually adds an overhead on the entire system.

In this case, if you were to design a system that tries to optimize all type of irregular FP values, you would have to either add HW to detect and record the state of each value after each operation (which would be multiplied by the number of physical FP registers, execution units, RS entries and so on - totaling in a significant number of transistors and wires. Alternatively, you would have to add some mechanism to check the value on read, which would slow you down when reading any FP value (even on the normal ones).

Furthermore, based on the type, you would need to perform some correction or not - on x86 this is the purpose of the assist code, but if you did not make a speculation, you would have to perform this flow conditionally on each value, which would already add a large chunk of that overhead on the common path.

Leeor
  • 19,260
  • 5
  • 56
  • 87
  • 2
    Isn't it really cheap to detect a denormal on read? All-zero exponent and non-zero mantissa? I guess it's still a couple gate delays to generate that, but I assumed the real cost would be in a hardware implementation of denormal-handling. If you had normal and denormal handling hardware, both could start working in parallel, and you'd take the result from whichever one your denormal-detector selected once it produced a result. – Peter Cordes Apr 24 '16 at 15:26
  • re: tagging FP values in registers and so on: Agner Fog has found big delays on AMD CPUs when using the output of a `mulps` instruction as input to a `mulpd` instruction, for example. His guess is that AMD CPUs tag the elements of FP vectors with something. (And are slow when buggy code causes a mis-speculation). – Peter Cordes Apr 24 '16 at 15:30
  • @PeterCordes, like I said, the check is simple, but even if done in HW, it would still require a condition check or (if we go by your idea) predicated execution, both may have penalties. I guess it was a design decision that we'll never know (without seeing the actual micro-code involved). – Leeor Apr 24 '16 at 18:54
  • Yet it's a design decision that's consistently made; all architectures I know of have huge performance penalties for denormals. Seems like there should be some architecture-independent reason for that. – Sneftel Apr 26 '16 at 08:31
  • 1
    Update on my previous comments: Handling subnormals as part of a pipelined FP execution unit doesn't hurt throughput, but can hurt *latency*. CPUs tend to care about latency, since dependency chains can be bottlenecks. GPUs are designed for heavily parallel problems and just make the pipeline longer to handle subnormals. (Some modern x86 CPUs handle some cases of subnormals without an assist, e.g. SnB-family for addition/subtraction.) Also, njuffa's answer on the later duplicate [Why are denormal floating-point values slower to handle?](https://stackoverflow.com/q/54937154) – Peter Cordes Jul 06 '22 at 01:59
3

Denormals are not handled by the FPU (H/W) in many architectures - so that leaves the implementation to s/w

There's a good basic intro here https://en.wikipedia.org/wiki/Denormal_number

Under Performance issues -

Alex Novickis
  • 39
  • 1
  • 3
  • 1
    What are some examples of such architectures? It sounds like x86 is not among them, for instance. – Nate Eldredge Apr 22 '16 at 01:28
  • 1
    Most modern architectures handle denormals in hardware, including x86. Early RISC chips tended not to support it, but recent versions of ARM certainly do. – Simon Byrne Apr 22 '16 at 12:10
  • 2
    What many architectures do is handle subnormals in microcode. That is faster and simpler than handling them in software. – Pascal Cuoq Apr 22 '16 at 12:35
  • 2
    Alex, I know that; my question is not **whether** that is so, it's **why** that is so. What possessed CPU architects to implement only normal numbers in hardware and risk tremendous slow-down, such as 10x slowdown of certain signal processing software? **Why** is that so difficult to implement both normals and denormals in hardware? – Michael Apr 22 '16 at 17:13
  • to meet the ieee spec your "system" has to comply not necessarily hardware, you can have software assist, and in fact some of the results change depending on exceptions or not. the x86 is famous for its floating point bugs and software assist, well after the original pentium bug. the spec was intentionally written to be very hard to comply with and hardware vendors will still have bugs. since you likely need to have software support why not rely on it for corner cases like denormals? you have to look at all steppings of all chips though to see what is hardware vs software if any. – old_timer Apr 24 '16 at 13:24
  • 3
    @dwelch: denormals are not "corner cases". It has been noted many times by signal processing people that because of denormals processing "silence" can bring a system to its knees because silence consists mainly of denormal inputs and CPUs choke on them. This is a real problem, not theoretical one. – Michael Apr 25 '16 at 06:58
  • didnt say it wasnt. of all the floating operations that happen per day, my guess is denormals are in the noise, thus you dont have to focus hardware on them, you can if need be pass them off to the cpu. Otherwise as a practice they would be focusing hardware on them too and we would have this question from the OP. – old_timer Apr 25 '16 at 13:54