Why using double and then cast to float?

Question

I'm trying to improve surf.cpp performances. From line 140, you can find this function:

inline float calcHaarPattern( const int* origin, const SurfHF* f, int n )
{
    double d = 0;
    for( int k = 0; k < n; k++ )
        d += (origin[f[k].p0] + origin[f[k].p3] - origin[f[k].p1] - origin[f[k].p2])*f[k].w;
    return (float)d;
}

Running an Intel Advisor Vectorization analysis, it shows that "1 Data type conversions present" which could be inefficient (especially in vectorization).

But my question is: looking at this function, why the authors would have created d as double and then cast it to float? If they wanted a decimal number, float would be ok. The only reason that comes to my mind is that since double is more precise than float, then it can represents smaller numbers, but the final value is big enough to be stored in a float, but I didn't run any test on d value.

Any other possible reason?

@tobi303 ehm [nope](http://stackoverflow.com/questions/10108053/ranges-of-floating-point-datatype-in-c) — justHelloWorld, Feb 09 '17 at 19:38
@FrançoisAndrieux so what? :) You can sum two doubles and save the result in a float without any cast, right? — justHelloWorld, Feb 09 '17 at 19:39
You *would* have to preform a narrowing cast from`double` to `float`, even if it's implicit. Since your input is (presumably) `double` and your output is `float` there has to be a cast *somewhere*. — François Andrieux, Feb 09 '17 at 19:42
Practically `double` will handle a wider range of numbers, but look at the third column. That is the REAL limiting feature. The point at which the numbers become damaged and possibly unusable due to lack of precision hits you much, much faster with a `float` than with a `double`. You are unlikely to get to those larger/smaller numbers before precision has smacked you silly. All of the numbers used have to be in a similar range or performing arithmetic on them will be meaningless. — user4581301, Feb 09 '17 at 19:42
It appears, from looking at the source code, that `f[k].w` is also `float`. I can only assume that `double` was used because the increased precision was relevant. — François Andrieux, Feb 09 '17 at 19:45
Reasons: 1. Ninjas will kill the author's family if `float` is not used. 2. Legacy API. Caller expects a `float` for reasons that have been lost to time. 3. Output of the function will be stored as a `float` there isn't enough enough RAM to store `double`s. 4. Code's writer consumed far too much weed the night before and is tripping out, man. 5... Why bother? I can keep guessing for weeks. — user4581301, Feb 09 '17 at 19:51
If they wanted a decimal number they would not have used floating-point. Do you mean 'real number'? or something else that implies a fractional part? — user207421, Feb 15 '17 at 23:04

m. c. · Accepted Answer · 2017-02-09T19:57:09.540

8

Because the author want to have higher precision during calculation, then only round the final result. This is the same as preserving more significant digit during calculation.

More precisely, when addition and subtraction, error can be accumulated. This error can be considerable when large number of floating point numbers involved.

edited Feb 09 '17 at 19:57

answered Feb 09 '17 at 19:46

m. c.

867
5
12

That's odd. Why aren't they casting `f[k].w` to `double` **before** multiplying with the integer. That way the code could be taking advantage of the higher precision of the sum, but decides not to with respect to the summands. That's really odd. – IInspectable Feb 09 '17 at 19:50
It seems only accumulation from 1 to n is promoted to double. Inside loop, those 4 numbers are kept in lower resolution... – m. c. Feb 09 '17 at 20:02
@IInspectable maybe because it doesn't make much difference? See the example in my answer. – Jonathan Wakely Feb 09 '17 at 20:04

Jonathan Wakely · Answer 2 · 2017-02-09T20:38:06.497

You questioned the answer saying it's to use higher precision during the summation, but I don't see why. That answer is correct. Consider this simplified version with completely made-up numbers:

#include <iostream>
#include <iomanip>

float w = 0.012345;

float calcFloat(const int* origin, int n )
{
    float d = 0;
    for( int k = 0; k < n; k++ )
        d += origin[k] * w;
    return (float)d;
}

float calcDouble(const int* origin, int n )
{
    double d = 0;
    for( int k = 0; k < n; k++ )
        d += origin[k] * w;
    return (float)d;
}


int main()
{
  int o[] = { 1111, 22222, 33333, 444444, 5555 };
  std::cout << std::setprecision(9) << calcFloat(o, 5) << '\n';
  std::cout << std::setprecision(9) << calcDouble(o, 5) << '\n';
}

The results are:

6254.77979
6254.7793

So even though the inputs are the same in both cases, you get a different result using double for the intermediate summation. Changing calcDouble to use (double)w doesn't change the output.

This suggests that the calculation of (origin[f[k].p0] + origin[f[k].p3] - origin[f[k].p1] - origin[f[k].p2])*f[k].w is high-enough precision, but the accumulation of errors during the summation is what they're trying to avoid.

This is because of how errors are propagated when working with floating point numbers. Quoting The Floating-Point Guide: Error Propagation:

In general:

Multiplication and division are “safe” operations

Addition and subtraction are dangerous, because when numbers of different magnitudes are involved, digits of the smaller-magnitude number are lost.

So you want the higher-precision type for the sum, which involves addition. Multiplying the integer by a double instead of a float doesn't matter nearly as much: you will get something that is approximately as accurate as the float value you start with (as long as the result it isn't very very large or very very small). But summing float values that could have very different orders of magnitude, even when the individual numbers themselves are representable as float, will accumulate errors and deviate further and further from the true answer.

To see that in action:

float f1 = 1e4, f2 = 1e-4;
std::cout << (f1 + f2) << '\n';
std::cout << (double(f1) + f2) << '\n';

Or equivalently, but closer to the original code:

float f1 = 1e4, f2 = 1e-4;
float f = f1;
f += f2;
double d = f1;
d += f2;
std::cout << f << '\n';
std::cout << d << '\n';

The result is:

10000                                                                                                                                                                                                             
10000.0001

Adding the two floats loses precision. Adding the float to a double gives the right answer, even though the inputs were identical. You need nine significant digits to represent the correct value, and that's too many for a float.

*"Changing `calcDouble` to use `(double)w` doesn't change the output."* - To be fair, it doesn't change the output, given the input **you picked at random**. This is nowhere near a proof, I'm sorry. — IInspectable, Feb 09 '17 at 20:08
I'm sorry, but you didn't ask for a proof and I didn't claim to be giving one. If you can't understand why using `double` for the sum matters you need to read up on floating point numbers and error propagation. I've added a reference for you to do that. — Jonathan Wakely, Feb 09 '17 at 20:17
*"Changing `calcDouble` to use `(double)w` doesn't change the output."* - That's an unconditional statement. If this is always true (as you imply), you'd need to offer something that's a bit stronger than *"based on my observations with a single set of very few samples"*. — IInspectable, Feb 10 '17 at 10:17
It's an unconditional statement about that piece of code, as is "The results are:". Do you think I'm suggesting you'll always get those results for any inputs? Don't be bloody silly and stop putting words in my mouth. Notice that the text you quoted says `(double)w` because I'm talking about my example program, not the OP's code that uses `f[k].w`, and it's a fact that changing _my example program_ to use `(double)w` doesn't change the output of _my example program_. I've also edited the answer to explain _why_ it behaves that way, so read up and stop being argumentative. — Jonathan Wakely, Feb 10 '17 at 11:27

Why using double and then cast to float?

2 Answers2

Linked