Is there a solution for Floating point Arithmetic problems in C++?

Question

I am doing some floating point arithmetic and having precision problems. The resulting value is different on two machines for the same input. I read the post @ Why can't I multiply a float? and also read other material on the web & understood that it is got to do with binary representation of floating point and on machine epsilon. However, I wanted to check if there is a way to solve this problem / Some work around for Floating point arithmetic in C++ ?? I am converting a float to unsigned short for storage and am converting back when necessary. However, when I convert it back to unsigned short, the precision (to 6 decimal points) remains correct on one machine but fails on the other.

//convert FLOAT to short

unsigned short sConst = 0xFFFF;

unsigned short shortValue = (unsigned short)(floatValue * sConst);

//Convert SHORT to FLOAT

float floatValue = ((float)shortValue / sConst);

What exactly is the problem? The mere fact that the multiplication result is slightly different on two different machines is not a problem by itself. Why does it constitute one for you? — Sven Marnach, Oct 28 '10 at 14:50
Since there is no actual problem stated, this question cannot be answered properly. Voting to close as not a real question. — David Thornley, Oct 28 '10 at 15:59
what is the precision on the other machine? 1 decimal place? 5? — Chance, Oct 28 '10 at 16:54
Provide example results form the two machines for the same input and details of the processor used in each case, and also the compiler and compiler options if they were different. — Clifford, Oct 28 '10 at 18:31

score 2 · Accepted Answer · answered Oct 28 '10 at 17:02

A short must be at least 16 bits, and in a whole lot of implementations that's exactly what it is. An unsigned 16-bit short will hold values from 0 to 65535. That means that a short will not hold a full five digits of precision, and certainly not six. If you want six digits, you need 20 bits.

Therefore, any loss of precision is likely due to the fact that you're trying to pack six digits of precision into something less than five digits. There is no solution to this, other than using an integral type that probably takes as much storage as a float.

I don't know why it would seem to work on one given system. Were you using the same numbers on both? Did one use an older floating-point system, and one that coincidentally gave the results you were expecting on the samples you tried? Was it possibly using a larger short than the other?

score 1 · Answer 2 · edited May 23 '17 at 11:47

If you want to use native floating point types, the best you can do is to assert that the values output by your program do not differ too much from a set of reference values.

The precise definition of "too much" depends entirely on your application. For example, if you compute a + b on different platforms, you should find the two results to be within machine precision of each other. On the other hand, if you're doing something more complicated like matrix inversion, the results will most likely differ by more than machine precision. Determining precisely how close you can expect the results to be to each other is a very subtle and complicated process. Unless you know exactly what you are doing, it is probably safer (and saner) to determine the amount of precision you need downstream in your application and verify that the result is sufficiently precise.

To get an idea about how to compute the relative error between two floating point values robustly, see this answer and the floating point guide linked therein:

Floating point comparison functions for C#

score 0 · Answer 3 · answered Oct 28 '10 at 14:52

0

Are you looking for standard like this:

Programming Languages C++ - Technical Report of Type 2 on Extensions for the programming language C++ to support decimal floating point arithmetic draft

answered Oct 28 '10 at 14:52

Sheen

3,333
5
26
46

1

There is no evidence in the question that the floats have a terminating decimal representation. – David Thornley Oct 28 '10 at 15:17

score 0 · Answer 4 · answered Dec 17 '11 at 09:49

Instead of using 0xFFFF use half of it, i.e. 32768 for conversion. 32768 (Ox8000) has a binary representation of 1000000000000000 whereas OxFFFF has a binary representation of 1111111111111111. Ox8000 's binary representation clearly implies, multiplication & divsion operations during conversion (to short (or) while converting back to float) will not change precision values after zero. For one side conversion, however OxFFFF is preferable, as it leads to more accurate result.

Is there a solution for Floating point Arithmetic problems in C++?

4 Answers4