Why a number crunching program starts running much slower when diverges into NaNs?

Question

A program repeats some calculation over an array of doubles. Then something unfortunate happens and NaN get produced... It starts running much slower after this.

-ffast-math does not change a thing.

Why does it happen with -ffast-math? Shouldn't it prevent throwing floating-point exceptions and just proceed and churn out NaNs at the same rate as usual numbers?

Simple example:

nan.c

#include <stdio.h>
#include <math.h>

int main() {
    long long int i;
    double a=-1,b=0,c=1;

    for(i=0; i<100000000; ++i) {
        a+=0.001*(b+c)/1000;
        b+=0.001*(a+c)/1000;
        c+=0.001*(a+b)/1000;
        if(i%1000000==0) { fprintf(stdout, "%g\n", a); fflush(stdout); }
        if(i==50000000) b=NAN;
    }
    return 0;
}

running:

$ gcc -ffast-math -O3 nan.c -o nan && ./nan  | ts '%.s'
...
1389025567.070093 2.00392e+33
1389025567.085662 1.48071e+34
1389025567.100250 1.0941e+35
1389025567.115273 8.08439e+35
1389025567.129992 5.9736e+36
1389025568.261108 nan
1389025569.385904 nan
1389025570.515169 nan
1389025571.657104 nan
1389025572.805347 nan

Update: Tried various -O3, -ffast-math, -msse, -msse3 - no effect. Hovewer when I tried building for 64-bits instead of usual 32-bits, it started to process NaNs as fast as other numbers (in addition to general 50% speedup), even without any optimisation options. Why NaNs are so slow in 32-bit mode with -ffast-math?

What exactly is -ffast-math supposed to do (ideally copy in gnu wiki description or similar, not a summary)? Reading this on phone, and find it hard to track down. — gnometorule, Jan 06 '14 at 16:26
` -ffast-math Sets -fno-math-errno, -funsafe-math-optimizations, -fno-trapping-math, -ffinite-math-only, -fno-rounding-math, -fno-signaling-nans and fcx-limited- range.` — Vi., Jan 06 '14 at 16:28
What processor are you running on? And are you compiling with SSE? — Mysticial, Jan 06 '14 at 16:35
Running on Intel Core i5, in 32-bit system, but on 64-bit kernel. Adding `-msse3` does not change thing. Hovewer using 64-bit compiler instead of 32-bit makes NaNs approximately as fast as other numbers (no 100-fold slowdown). — Vi., Jan 06 '14 at 20:18
Some compilers use x87 for fp codegen by default on 32-bit even if SSE is enabled. Try adding `-mfpmath=sse` to your 32-bit C flags. — Stephen Canon, Jan 06 '14 at 22:53
Yes, `-mfpmath=sse` => as fast, as 64-bit, including for NaNs. — Vi., Jan 06 '14 at 22:55

score 4 · Answer 1 · answered Jan 06 '14 at 16:31

4

Floating point operations on NaN are exceptional cases and definitely take longer to execute. It's important to remember when vectorizing with SSE because any NaNs that sneak into don't-care columns in the registers can still make your code run much slower.

This page includes some performance measurements of math on NaN which is even worse than I thought!

answered Jan 06 '14 at 16:31

Ben Jackson

90,079
9
98
150

Is it just processor runs them slowly in general, or it delegates processing of NaN to something other (i.e. throwing an exception for OS to emulate)? It's not just slower, it's 100 times slower... – Vi. Jan 06 '14 at 16:34
2

NaN and inf are handled at speed on SSE. They are only a performance hazard when floating-point is evaluated on the legacy x87 instructions (yet another reason to avoid x87!) – Stephen Canon Jan 06 '14 at 17:10
1

@StephenCanon I've definitely hit perf issues with NaN and SSE, as has the guy who tested at the link I added. – Ben Jackson Jan 06 '14 at 17:52
@BenJackson: the page you linked to says: "NaNs and infinities have long been handled at full speed on this floating-point unit [SSE].” He then goes on to discuss slowdowns associated with *denormals* on SSE. – Stephen Canon Jan 06 '14 at 19:27
Looks like it behaves much differently in 32-bit and 64-bit modes. – Vi. Jan 06 '14 at 20:23
@StephenCanon I guess I'm glomming "NaN" and "garbage" which could include denormals. – Ben Jackson Jan 06 '14 at 21:15
@Vi. I believe that the x87 FPU goes to microcode when it hits these special numbers. The microcode routines have not been optimized which slows them down even more. As soon as the FPU pipeline can't handle instructions all of the performance and parallelism gets lost. – Bruce Dawson Nov 11 '14 at 01:06

score 4 · Accepted Answer · answered Jan 06 '14 at 22:55

4

Your compiler defaults to using x87 (which incurs a stall for processing NaNs) when producing a 32-bit executable. Pass -mfpmath=sse to tell it to use SSE (which can handle NaNs at speed) instead.

answered Jan 06 '14 at 22:55

Stephen Canon

103,815
19
183
269

Why x87 it is not turned off with `-ffast-math`? – Vi. Jan 06 '14 at 22:56
1

`-ffast-math` simply licenses the compiler to reassociate some expressions, optimize as if NaN, inf, and signed zero do not exist, etc. It doesn't change anything about what instruction set is targeted. – Stephen Canon Jan 06 '14 at 22:57
Looks like Speedy Nans is caused by `64-bit` (except of explicit `-mfpmath=387`) or `-mfpmath=sse` and `-msse2` (just `-msse` is not enough); independently of `-O3` or `-ffast-math`. – Vi. Jan 06 '14 at 23:04
Exactly right. You need `-msse2` because `SSE[1]` doesn't support double-precision. – Stephen Canon Jan 06 '14 at 23:09

Why a number crunching program starts running much slower when diverges into NaNs?

2 Answers2

Linked