4

I just experimented with the SSE option "denormals are zeros" through setting this option with _mm_setcsr( _mm_getcsr() | 0x40 ).

I found an in interesting thing: this doesn't prevent SSE from generating denormals when both operands are non-denormal! It just makes SSE consider denormal operands as if they were zeros.

As I explained I know what this option does. But what is this option good for?


Addendum

I just read the Intel article linked by user nucleon. And I was curious about the performance impact of denormals on SSE computations.

So I wrote a little Windows program to test this:

#include <windows.h>
#include <intrin.h>
#include <iostream>

using namespace std;

union DBL
{
    DWORDLONG dwlValue;
    double    value;
};

int main()
{
    DWORDLONG dwlTicks;
    DBL       d;
    double    sum;

    dwlTicks = __rdtsc();

    for( d.dwlValue = 0, sum = 0.0; d.dwlValue < 100000000; d.dwlValue++ )
        sum += d.value;

    dwlTicks = __rdtsc() - dwlTicks;
    cout << sum << endl;
    cout << dwlTicks / 100000000.0 << endl;

    dwlTicks = __rdtsc();

    for( d.dwlValue = 0x0010000000000000u, sum = 0.0;
         d.dwlValue < (0x0010000000000000u + 100000000); d.dwlValue++ )
        sum += d.value;

    dwlTicks = __rdtsc() - dwlTicks;
    cout << sum << endl;
    cout << dwlTicks / 100000000.0 << endl;

    return 0;
}

(I printed the sums only to prevent the compiler from optimizing away the summation.)

The result is that on my Xeon E3-1240 (Skylake), each iteration takes four clock-cycles when "d" is non-denormal. When "d" is a denormal, each iteration takes about 150 clock cycles! I'd never believe denormals would have such a huge performance impact if I hadn't seen the opposite.

Community
  • 1
  • 1
Bonita Montero
  • 2,817
  • 9
  • 22
  • 4
    Referring to https://software.intel.com/en-us/node/513376 it seems that the more rigorous setting is `flush-to-zero (FTZ)`. My guess is that using a denormal as an input operand is more expensive than generating a denormal from a calculation with normal operands. So `denormals-are-zero (DAZ)` is a tradeoff, which is faster than non-FTZ, but slower than FTZ. On top of that: If you use FTZ, setting DAZ should never have an effect, except you access denormals calculated before setting FTZ. – nucleon Jun 17 '16 at 17:08
  • Thanks nucleon. The Intel-article is very interesting! I've updated my posting with a little test on denormals. – Bonita Montero Jun 17 '16 at 17:44
  • 1
    See also: [Bruce Dawson's article on denormals, why they matter, and the perf implications](https://randomascii.wordpress.com/2012/05/20/thats-not-normalthe-performance-of-odd-floats/). When they're slow on Intel CPUs, it's because the CPU traps to a microcode assist. AMD CPUs generally handle them in HW always, and don't have a speed penalty. – Peter Cordes Oct 14 '16 at 17:05

0 Answers0