You questioned the answer saying it's to use higher precision during the summation, but I don't see why. That answer is correct. Consider this simplified version with completely made-up numbers:
#include <iostream>
#include <iomanip>
float w = 0.012345;
float calcFloat(const int* origin, int n )
{
float d = 0;
for( int k = 0; k < n; k++ )
d += origin[k] * w;
return (float)d;
}
float calcDouble(const int* origin, int n )
{
double d = 0;
for( int k = 0; k < n; k++ )
d += origin[k] * w;
return (float)d;
}
int main()
{
int o[] = { 1111, 22222, 33333, 444444, 5555 };
std::cout << std::setprecision(9) << calcFloat(o, 5) << '\n';
std::cout << std::setprecision(9) << calcDouble(o, 5) << '\n';
}
The results are:
6254.77979
6254.7793
So even though the inputs are the same in both cases, you get a different result using double
for the intermediate summation. Changing calcDouble
to use (double)w
doesn't change the output.
This suggests that the calculation of (origin[f[k].p0] + origin[f[k].p3] - origin[f[k].p1] - origin[f[k].p2])*f[k].w
is high-enough precision, but the accumulation of errors during the summation is what they're trying to avoid.
This is because of how errors are propagated when working with floating point numbers. Quoting The Floating-Point Guide: Error Propagation:
In general:
- Multiplication and division are “safe” operations
- Addition and subtraction are dangerous, because when numbers of different magnitudes are involved, digits of the smaller-magnitude number are lost.
So you want the higher-precision type for the sum, which involves addition. Multiplying the integer by a double
instead of a float
doesn't matter nearly as much: you will get something that is approximately as accurate as the float
value you start with (as long as the result it isn't very very large or very very small). But summing float
values that could have very different orders of magnitude, even when the individual numbers themselves are representable as float
, will accumulate errors and deviate further and further from the true answer.
To see that in action:
float f1 = 1e4, f2 = 1e-4;
std::cout << (f1 + f2) << '\n';
std::cout << (double(f1) + f2) << '\n';
Or equivalently, but closer to the original code:
float f1 = 1e4, f2 = 1e-4;
float f = f1;
f += f2;
double d = f1;
d += f2;
std::cout << f << '\n';
std::cout << d << '\n';
The result is:
10000
10000.0001
Adding the two floats loses precision. Adding the float to a double gives the right answer, even though the inputs were identical. You need nine significant digits to represent the correct value, and that's too many for a float
.