Math Error with Primitive Operators

Question

I am having an issue with primitive types using built in operators. All of my operators work for all datatypes except for float and (un)signed long long int.

Why is it wrong even when multiplying by one? Also, why does +10 and -10 give the same number as +1, -1, /1, and *1.

The number 461168601 was chosen because it fits within the max float and max signed long long int.

Ran the following code and got the following output:

fmax  : 340282346638528859811704183484516925440
imax  : 9223372036854775807
i     : 461168601
f     : 10
f2    : 1

461168601 / 10 = 46116860
461168601 + 10 = 461168608
461168601 - 10 = 461168608

461168601 * 1 = 461168608
461168601 / 1 = 461168608
461168601 + 1 = 461168608
461168601 - 1 = 461168608

The following code can be ran here.

#include <iostream>
#include <sstream>
#include <iomanip>
#include <limits>

#define fmax std::numeric_limits<float>::max()
#define imax std::numeric_limits<signed long long int>::max()

int main()
{

    signed long long int i    = 461168601;
    float f = 10;
    float f2 = 1;
    std::cout << std::setprecision(40);
    std::cout <<"fmax  : " << fmax  << std::endl;
    std::cout <<"imax  : " << imax  << std::endl;
    std::cout <<"i     : " << i    << std::endl;
    std::cout <<"f     : " << f    << std::endl;
    std::cout <<"f2    : " << f2   << std::endl;
    std::cout <<std::endl;
    std::cout << i << " / " << f << " = " << i / f << std::endl;
    std::cout << i << " + " << f << " = " << i + f << std::endl;
    std::cout << i << " - " << f << " = " << i - f << std::endl;
    std::cout <<std::endl;
    std::cout << i << " * " << f2 << " = " <<i * f2 << std::endl;
    std::cout << i << " / " << f2 << " = " << i / f2 << std::endl;
    std::cout << i << " + " << f2 << " = " << i + f2 << std::endl;
    std::cout << i << " - " << f2 << " = " << i - f2 << std::endl;
}

Possible duplicate of [Implicit type conversion rules in C++ operators](http://stackoverflow.com/questions/5563000/implicit-type-conversion-rules-in-c-operators) — Mohamad Elghawi, Nov 25 '15 at 16:11
My question is not pertaining to the typecasting, but the incorrect results. — tkellehe, Nov 25 '15 at 16:15
Why are you calling `2**63-1` FLT_MAX?? That is far less than the max magnitude of a float and far more than the max precision of a float. Your "wrong" answer consists of casting `2**62-1` to a `float` which results in `2**62` There is no float representation of `2**62-1` but there is a float representation for `2**62` — JSF, Nov 25 '15 at 16:15
@TMKelleher You asked why an unsigned long long int would be cast to a float. — Mohamad Elghawi, Nov 25 '15 at 16:19
You made two mistakes: The first one involves implicit type conversion (`4611686018427387903` is first converted to a `float` before any operation), the second involves the way values are stored into a float - Unlike integer types, if a float can store the exact value of `X`, it does not mean it can extactly store any integer value between `0` and `X`. It is not possible to store `4611686018427387903` exactly in a `float`, so you get the closest value which is `4611686018427387904`, which is why your `* 1` operation is wrong (same reasoning for the other operations). — Holt, Nov 25 '15 at 16:19

Alex · Answer 1 · 2015-11-25T16:35:19.103

0

The error is caused by the too big difference between 4611686018427387904 and 1 or 10. You should never sum numbers with a such difference, because actual difference between two closest floating point numbers grows with exponent value.

When two floating point numbers are added, the first of all they are aligned to the same exponent value (the bigger one), so before operation you have e.g. 1e10 and 1e-10 and after alignment you have 1e10 and 0e10 the result is 1e10.

edited Nov 25 '15 at 16:35

answered Nov 25 '15 at 16:27

Alex

9,891
11
53
87

If I am understanding you the first number should become something like `4.611686018427387904E18` and `10` becomes `0.000000000000000001E18` which may end up being `0` if the number is small enough. So that would mean that `4611686018427387904 + 1` should equal `4611686018427387904` due to the loss of the `1` due to precision. Alas that is still not the case. – tkellehe Nov 25 '15 at 16:45
@TMKelleher Even casting to float can cause the issue. Just try to output: `(double) (461168601)` – Alex Nov 25 '15 at 16:55
That makes sense, but why would I get the same problems when I use the number `230584000` where casting to a `float` returns the same number? `230584000 + 10 = 230584016` `230584000 - 10 = 230583984` `230584000 * 1 = 230584000` `230584000 / 1 = 230584000` `230584000 + 1 = 230584000` `230584000 - 1 = 230584000` – tkellehe Nov 25 '15 at 17:13

score 0 · Accepted Answer · answered Nov 25 '15 at 18:05

Dug around some and found this article.

Casting opens up its own can of worms. You have to be careful, because your float might not have enough precision to preserve an entire integer. A 32-bit integer can represent any 9-digit decimal number, but a 32-bit float only offers about 7 digits of precision. So if you have large integers, making this conversion will clobber them. Thankfully, doubles have enough precision to preserve a whole 32-bit integer (notice, again, the analogy between floating point precision and integer dynamic range). Also, there is some overhead associated with converting between numeric types, going from float to int or between float and double.

So, essentially once the whole part of a number reaches about more than seven digits, the float begins to shift the number to keep the whole part of the number about seven digits. When this shifting of the decimal place occurs, the number begins to reach the floating point inaccuracy.

Math Error with Primitive Operators

2 Answers2