What CPU state has an effect on intel FPU and SSE performance?

Question

In trying to track down a performance issue, I ended up looking for information on what can have an effect on the performance of x87 and SSE instructions. I found that information incredibly difficult to track down as it tends to be hidden deep inside large Intel PDFs or sometimes mentioned on 3rd party websites without much explanation.

This question is about control words, bits, modes, specific data (eg. denormals), whatever. It is not about memory bandwidth, cache, page tables, alignment or anything else memory related. I'll answer with a basic list of I've found so far but feel free to add more details or new state I'm not aware of.

What about the values being operated on, should that go in the "interesting for the purpose of this question"-list? It's not state, but it's not any of the other things either. — harold, Dec 22 '16 at 21:25
Unfortunately there is an awful lot of implicit internal execution state in modern out-of-order processors affecting performance, typically only documented in general terms. So you might, for instance, encounter a stall when a SSE register result from a floating-point operation is transferred to a different execution port for processing by an integer instruction. Anyway, I suspect you might have better luck solving your problem by giving the details of your specific performance issue instead. — doynax, Dec 22 '16 at 22:01
@doynax I know there's a lot going in there but I'm mostly interested obscure details which might have a large, unsuspected impact. Like, for example, if infinities or NaNs were several times slower than normal values for some operations. I will post another question with my specific problem when I've brought it down to a MCVE and add the explanation here... but it's starting to look like a CPU bug (same speed on one CPU, 6x difference on another). — Olivier, Dec 23 '16 at 00:45
@Olivier Last I checked, processing of denormals, infinities, and NaNs all had a significant negative impact on performance. Are you encountering any of those in your computation? I will try to find a reference. — njuffa, Dec 23 '16 at 03:47
Markus Wittmann, Thomas Zeiser, Georg Hager, and Gerhard Wellein, "Short Note on Costs of Floating Point Operations on current x86-64 Architectures: Denormals, Overflow, Underflow, and Division by Zero", ArXiv manuscript, June 2015 ([online](https://arxiv.org/pdf/1506.03997v1.pdf)) — njuffa, Dec 23 '16 at 04:30
Marc Andrysco, David Kohlbrenner, Keaton Mowery, Ranjit Jhala, Sorin Lerner, and Hovav Shacham, "On Subnormal Floating Point and Abnormal Timing", In *IEEE Symposium on Security and Privacy*, May 2015, pp. 623-639 [(online)](http://www.ieee-security.org/TC/SP2015/papers-archived/6949a623.pdf) — njuffa, Dec 23 '16 at 04:42
There is a possibly related [question](http://stackoverflow.com/questions/9314534/why-does-changing-0-1f-to-0-slow-down-performance-by-10x) regarding the performance impact of denormals — njuffa, Dec 23 '16 at 04:48
@njuffa No, I don't have any but those are great links. I've now posted a separate question with the actual problem I have. — Olivier, Dec 23 '16 at 15:13

score 3 · Answer 1 · answered Dec 22 '16 at 20:39

3

So far, I've found:

The FPU Control World (FCW). This has a precision field which affects the speed of some operations. It is mostly obsolete as it only affects x87 instructions as far as I can tell.
The MXCSR register. This affects SSE math through the DAZ (denormals are zero) and FTZ (flush to zero) bits. Calculations with denormals are slower.
The state of the upper part of AVX registers. Cleared with the vzeroupper instruction. There is a very technical discussion about it on the intel forums: Software consequences of extending XMM to YMM

answered Dec 22 '16 at 20:39

Olivier

1,144
1
8
15

Precision in the FCW should only speed updivision and sqrt, not mul/add/sub. Search for [`D3DCREATE_FPU_PRESERVE` in Bruce Dawson's Intermediate Floating-Point Precision article](https://randomascii.wordpress.com/2012/03/21/intermediate-floating-point-precision/). Agner Fog lists constant times for normalized floats with SSE mul/add/sub as well. – Peter Cordes Dec 28 '16 at 03:55
SSE/AVX are not slowed down by Inf or NaN. [x87 can be](http://stackoverflow.com/a/31879376/224132). – Peter Cordes Dec 28 '16 at 03:56

What CPU state has an effect on intel FPU and SSE performance?

1 Answers1