22

I have a large amount of data to process with math intensive operations on each data set. Much of it is analogous to image processing. However, since this data is read directly from a physical device, many of the pixel values can be invalid.

This makes NaN's property of representing values that are not a number and spreading on arithmetic operations very compelling. However, it also seems to require turning off some optimizations such as gcc's -ffast-math, plus we need to be cross platform. Our current design uses a simple struct that contains a float value and a bool indicating validity.

While it seems NaN was designed with this use in mind, others think it is more trouble than it is worth. Does anyone have advice based on their more intimate experience with IEEE754 with performance in mind?

Community
  • 1
  • 1
Andrew Hundt
  • 2,551
  • 2
  • 32
  • 64
  • 6
    Design your program so that this implementation detail is confined to one small component; then write one with NaNs and one with bools and compare. – Kerrek SB Mar 05 '12 at 21:25
  • I have experience from two projects. In one it worked fine, since we always used exactly one NaN. In another, things went out of hand, since people started encoding other stuff. "unitnialized", "not available", "out of range" etc. – PlasmaHH Mar 05 '12 at 21:25
  • @PlasmaHH: The latter is obviously wrong (that's why you have `+INF` and `-INF`). – MSalters Mar 06 '12 at 09:13
  • 1
    @MSalters: obviosuly. but that doesnt stop some people from doing it. whenever you dont have the total control, you also have to think about what other people would make out of your idea. – PlasmaHH Mar 06 '12 at 09:18
  • 1
    Do test on both Intel and AMD (and possibly CPUs as well, for the future). The handling of `NaN` is a special case and speeds are not consistent across implementations. – MSalters Mar 06 '12 at 09:18
  • @MSalters: Intel and AMD do make CPUs. – Lightness Races in Orbit Mar 08 '12 at 19:04
  • @LightnessRacesinOrbit: I don't remember if I meant **G**PU's or ARM CPU's, but indeed the current comment is a bit weird. – MSalters Mar 09 '12 at 08:44
  • For clarification, are your receiving a specific NaN from the hardware, or is the hardware sending 32 bits which may or may not represent a valid floating point? From your description, it sounds like the hardware may return invalid data without a well defined value. – Alexis Wilke Mar 12 '12 at 07:57
  • @AlexisWilke We are supporting multiple devices that have various interesting ways of representing invalid data, usually "magic numbers" that of course differ for every device. I'm looking to take that data and reformat it in a way that will work with generic algorithm implementations. – Andrew Hundt Mar 14 '12 at 15:50
  • In that case, it seems to me that each driver should transform said magic numbers into a well defined common NaN. Although this means scanning all the data... but if you know the specific magic numbers for hardware A, and the specific magic numbers for hardware B, etc. then it becomes fairly simple? – Alexis Wilke Apr 09 '12 at 02:34
  • @AlexisWilke Yes, it is simple to remap the values, and we do just that. The essence of the question is asking about the best destination format to remap to. – Andrew Hundt Apr 10 '12 at 18:33

3 Answers3

19

BRIEF: For strictest portability, don't use NaNs. Use a separate valid bit. E.g. a template like Valid. However, if you know that you will only ever run on IEEE 754-2008 machines, and not IEEE 754-1985 (see below), then you may get away with it.

For performance, it is probably faster not to use NaNs on most of the machines that you have access to. However, I have been involved with hardware design of FP on several machines that are improving NaN handling performance, so there is a trend to make NaNs faster, and, in particular, signalling NaNs should soon be faster than Valid.

DETAIL:

Not all floating point formats have NaNs. Not all systems use IEEE floating point. IBM hex floating point can still be found on some machines - actually systems, since IBM now supports IEEE FP on more recent machines.

Furthermore, IEEE Floating Point itself had compatibility issues wrt NaNs, in IEEE 754-1985. E.g, see wikipedia http://en.wikipedia.org/wiki/NaN:

The original IEEE 754 standard from 1985 (IEEE 754-1985) only described binary floating point formats, and did not specify how the signaled/quiet state was to be tagged. In practice, the most significant bit of the significand determined whether a NaN is signalling or quiet. Two different implementations, with reversed meanings, resulted. * most processors (including those of the Intel/AMD x86-32/x86-64 family, the Motorola 68000 family, the AIM PowerPC family, the ARM family, and the Sun SPARC family) set the signaled/quiet bit to non-zero if the NaN is quiet, and to zero if the NaN is signaling. Thus, on these processors, the bit represents an 'is_quiet' flag. * in NaNs generated by the PA-RISC and MIPS processors, the signaled/quiet bit is zero if the NaN is quiet, and non-zero if the NaN is signaling. Thus, on these processors, the bit represents an 'is_signaling' flag.

This, if your code may run on older HP machines, or current MIPS machines (which are ubiquitous in embedded systems), you should not depend on a fixed encoding of NaN, but should have a machine dependent #ifdef for your special NaNs.

IEEE 754-2008 standardizes NaN encodings, so this is getting better. It depends on your market.

As for performance: many machines essentially trap, or otherwise take a major hiccup in performance, when performing computations involving both SNaNs (which must trap) and QNaNs (which don't need to trap, i.e. which could be fast - and which are getting faster in some machines as we speak.)

I can say with confidence that on older machines, particularly older Intel machines, you did NOT want to use NaNs if you cared about performance. E.g. http://www.cygnus-software.com/papers/x86andinfinity.html says "The Intel Pentium 4 handles infinities, NANs, and denormals very badly. ... If you write code that adds floating point numbers at the rate of one per clock cycle, and then throw infinities at it as input, the performance drops. A lot. A huge amount. ... NANs are even slower. Addition with NANs takes about 930 cycles. ... Denormals are a bit trickier to measure."

Get the picture? Almost 1000x slower to use a NaN than to do a normal floating point operation? In this case it is almost guaranteed that using a template like Valid will be faster.

However, see the reference to "Pentium 4"? That's a really old web page. For years people like me have been saying "QNaNs should be faster", and it has slowly taken hold.

More recently (2009), Microsoft says http://connect.microsoft.com/VisualStudio/feedback/details/498934/big-performance-penalty-for-checking-for-nans-or-infinity "If you do math on arrays of double that contain large numbers of NaN's or Infinities, there is an order of magnitude performance penalty."

If I feel impelled, I may go and run a microbenchmark on some machines. But you should get the picture.

This should be changing because it is not that hard to make QNaNs fast. But it has always been a chicken and egg problem: hardware guys like those I work with say "Nobody uses NaNs, so we won;t make them fast", while software guys don't use NaNs because they are slow. Still, the tide is slowly changing.

Heck, if you are using gcc and want best performance, you turn on optimizations like "-ffinite-math-only ... Allow optimizations for floating-point arithmetic that assume that arguments and results are not NaNs or +-Infs." Similar is true for most compilers.

By the way, you can google like I did, "NaN performance floating point" and check refs out yourself. And/or run your own microbenchmarks.

Finally, I have been assuming that you are using a template like

template<typename T> class Valid {
    ...
    bool valid;
    T value;
    ...
};

I like templates like this, because they can bring "validity tracking" not just to FP, but also to integer (Valid), etc.

But, they can have a big cost. The operations are probably not much more expensive than NaN handling on old machines, but the data density can be really poor. sizeof(Valid) may sometimes be 2*sizeof(float). This bad density may hurt performance much more than the operations involved.

By the way, you should consider template specialization, so that Valid uses NaNs if they arte available and fast, and a valid bit otherwise.

template <> class Valid<float> { 
    float value; 
    bool is_valid() { 
        return value != my_special_NaN; 
    } 
}

etc.

Anyway, you are better off having as few valid bits as possible, and packing them elsewhere, rather than Valid right close to the value. E.g.

struct Point { float x, y, z; };
Valid<Point> pt;

is better (density wise) than

struct Point_with_Valid_Coords { Valid<float> x, y, z; };

unless you are using NaNs - or some other special encoding.

And

struct Point_with_Valid_Coords { float x, y, z; bool valid_x, valid_y, valid_z };

is in between - but then you have to do all the code yourself.

BTW, I have been assuming you are using C++. If FORTRAN or Java ...

BOTTOM LINE: separate valid bits is probably faster and more portable.

But NaN handling is speeding up, and one day soon will be good enough

By the way, my preference: create a Valid template. Then you can use it for all data types. Specialize it for NaNs if it helps. Although my life is making things faster, IMHO it is usually more important to make the code clean.

Krazy Glew
  • 7,210
  • 2
  • 49
  • 62
  • I've had a tag specifying c++, but perhaps I should have also put it in the text. The most important platform I need to support is Intel Atom processors, then secondary is x86, x86-64, then everything else is a nice to have. I haven't been able to find much documentation about how atom performs with NaN. I may have to make a small benchmark. – Andrew Hundt Mar 16 '12 at 20:10
  • Do the benchmark and check. But I would be (agreeably) surprised if Atom handles NaNs quickly. I only left Intel 20 months ago. – Krazy Glew Mar 18 '12 at 01:30
  • Even on the same CPU there might be different hardware that handles those NaNs, with very different performance characteristics (FPU vs SSE, the later is supposed to be at least as good and possibly much better). As of today, Intel i5 FPU is still bad. – Eugene Ryabtsev Mar 06 '14 at 10:42
3

If invalid data is very common, you are of course wasting a lot of time on running this data through the processing. If the invalid data is common enough it is probably better to be running some kind of sparse datastructure of only the valid data. If it is not very common, you can of course keep a sparse datastructure of which data is invalid. That way you would not waste a bool for each value. But maybe memory is not a problem for you...

If you are doing operations such as multipling two possibly invalid data entries, I understand it is compelling to use NaNs instead of doing checks on both variables to see if they are valid and setting the same flag in the resultant.

How portable do you need to be? Will you ever need to be able to port it to an architecture with only fixed point support? If that is the case, I think your choice is clear.

Personally I would only use NaNs if it proved to be much faster. Otherwise I'd say the code gets more clear if you have explicit handling of invalid data.

tombjo
  • 191
  • 8
  • I would say that right now in our project portable means linux, windows, osx, x86, x86-64, arm. We definitely have a floating point requirement that can't be avoided. – Andrew Hundt Mar 09 '12 at 19:53
2

Since the floating-point numbers come from a device, they probably have a limited range. You can use some other special number, rather than NaN, to indicate absense of data, e.g. 1e37. This solution is portable. I do not know whether or not is more convinient for you than using a bool flag.

Eugene
  • 6,194
  • 1
  • 20
  • 31
  • "*This solution is portable.*" Only if the chosen number is exactly representable on all underlying FP hardware, which makes this solution by definition _not_ portable. – ildjarn Mar 05 '12 at 21:49
  • 2
    It doesn't really matter. If 1E37 would be stored as 1E37+1, and then compared to 1E37 later, the second number would _also_ be rounded to 1E37+1, and the two rounded values would compare equal. I.e. the solution is portable across all deterministic FP representations, including but not limited to IEEE754. – MSalters Mar 06 '12 at 09:17
  • Indeed, as long as you don't generate it in some cases via `1E37/2 + 1E37/2` or something equally mental – Lightness Races in Orbit Mar 08 '12 at 19:05