Converting single-precision floating point numbers to double-precision for division

Question

Working as a High-Performance-Computing guy, we tend to default to single-precision floating point numbers (float or real) whenever possible. This is because you can perform more operations per second if each operation is individually faster to perform.

One of the more senior people I work with, however, always insists that (when accuracy is required) you should temporarily convert your single-precision data to double-precision in order to perform division. That is:

float a, b;
float ans = ((double)a)/((double)b);

or

real :: a, b, ans
ans = real(dble(a)/dble(b))

depending on the language you're working in. In my opinion, this looks really ugly, and to be honest I don't even know if the answer held in ans will be more accurate than if you had simply written ans = a/b in single-point precision.

Can someone tell me whether converting your numbers prior to arithmetic, specifically for performing division, will actually result in a more accurate answer? Would this be a language/compiler specific question, or would this be up to IEEE? With what number values would this accuracy improvement be most noticeable?

Any enlightening comments/answers would be much appreciated.

Only a single cast is needed, because all other operands in the expression will be automatically promoted to the appropriate type. Moreover, in some architectures floats are not faster than doubles, and they have only ~7 digits of precision, which may not suit a lot of purposes — phuclv, Jul 10 '15 at 23:31
In fact `float` arithmetic can be slower than `double` arithmetic. You really need to benchmark. — M.M, Jul 11 '15 at 01:40
You've tagged this fortran - have you done any testing/analysis using fortran code? — Ross, Jul 11 '15 at 04:19
@MattMcNabb In practice, you will save a lot of cache and memory bandwith swith singles and that is more important than the instruction duration. Also, the SIMD intrinsics for singles are available and can do more operations at a time than doubles. — Vladimir F Героям слава, Jul 11 '15 at 08:09
@MattMcNabb What Vladmir said is true and what I'm going for. Compilers using SIMD instructions are going to be able to work with twice the `floats` than `doubles`, essentially doubling your performance. Yes we have tested this. — NoseKnowsAll, Jul 11 '15 at 18:03

Pascal Cuoq · Accepted Answer · 2015-07-11T11:04:05.787

10

float ans = ((double)a)/((double)b);

This article demonstrates that ans is always the same as would be computed by a single-precision division for IEEE 754 arithmetics and FLT_EVAL_METHOD=0.

When FLT_EVAL_METHOD=1, the same property is also trivially true.

When FLT_EVAL_METHOD=2, I am not sure. It is possible that one might interpret the rules as meaning that the long double computation of a/b must first be rounded to double, then to float. In this case, it can be less accurate than directly rounding from long double to float (the latter produces the correctly rounded results, whereas the former could fail to do so in extremely rare cases, unless another theorem like Figueroa's applies and shows that this never happens).

Long story short, for any modern, reasonable floating-point computing platform (*), it is superstition that float ans = ((double)a)/((double)b); has any benefits. You should ask the senior people you refer to in your question to exhibit one pair a, b of values for which the result is different, not to mention more accurate. Surely if they insist that this is better it should be no trouble for them to provide one single pair of values for which it makes a difference.

(*) remember to use -fexcess-precision=standard with GCC to preserve your sanity

edited Jul 11 '15 at 11:04

answered Jul 11 '15 at 00:58

Pascal Cuoq

79,187
7
161
281

1

+1 for pointing to Figueroa's little-known paper. See also this [recent paper by Roux](https://hal.archives-ouvertes.fr/hal-01091186/document) – njuffa Jul 11 '15 at 06:54
Yours answer appears to contradict the numerical tests provided in other 9older) answers. How do explain that? – Vladimir F Героям слава Jul 11 '15 at 08:50
2

Got it, the others are in error when they print the double result directly or use more operation than in the question. We are however interested in the final result converted to a single. +1 – Vladimir F Героям слава Jul 11 '15 at 09:08
Thanks for the discussion. I didn't think to look for a environment variable that would control this behavior. I don't need to butt heads with him; I was just hoping to further my own knowledge on the matter. – NoseKnowsAll Jul 11 '15 at 18:11

wallyk · Answer 2 · 2015-07-11T00:10:22.053

This depends greatly on what platform is being used.

An 80x86 (or a 1980s-era 8087) using non-SSE instructions performs all its arithmetic using 80-bit precision (long double or real*10). It is the "store" instruction which moves results from the numeric processor to memory which loses precision.

Unless it is a really bone-headed compiler, maximum precision should occur from

float a = something, b = something_else;
float ans = a/b;

since to perform the division, the single precision operands will be extended precision after loading and the result will be extended precision.

If you were doing something more intricate and wanted to maintain maximum precision, don't store intermediate results in smaller-sized variables:

float a, b, c, d;

float prod_ad = a * d;
float prod_bc = b * c;
float sum_both = prod_ad + prod_bc;   // less accurate

That gives a less precise result than doing it all at once since most compilers will produce code which keeps all the intermediates values at extended precision:

float a, b, c, d;

float sum_both = a * d + b * c;   // more accurate

Building on Eugeniu Rosca's example program:

#include "stdio.h"
void main(void)
{
    float a=73;
    float b=19;

    long double a1 = a;
    long double b1 = b;

    float ans1 = (a*a*a/b/b/b);
    float ans2 = ((double)a*(double)a*(double)a/(double)b/(double)b/(double)b);
    float ans3 = a1*a1*a1/b1/b1/b1;
    long double ans4 = a1*a1*a1/b1/b1/b1;

    printf ("plain:  %.20g\n", ans1);
    printf ("cast:   %.20g\n", ans2);
    printf ("native: %.20g\n", ans3);
    printf ("full:   %.20Lg\n", ans4);
}

provides, no matter the optimization level

plain:  56.716281890869140625
cast:   56.71628570556640625
native: 56.71628570556640625
full:   56.716285172765709289

This is showing that for trivial operations, there isn't much difference. However, changing the constants to be more of a precision challenge:

float a=0.333333333333333333333333;
float b=0.1;

provides

plain:  37.03704071044921875
cast:   37.037036895751953125
native: 37.037036895751953125
full:   37.037038692721614131

where the precision difference is displaying a more pronounced effect.

`sum_both` will be the same in the above 2 cases if `FLT_EVAL_METHOD == 0`. Most modern systems don't use big internal precision like x87 because it's hard to predict the result — phuclv, Jul 10 '15 at 23:36

score 3 · Answer 3 · answered Jul 10 '15 at 23:14

Yes, converting to double precision will give you better accuracy (or, shall I say, precision) in division. One could say that this is up to IEEE, but only because IEEE defines the formats and standards. doubles are inherently more precise than floats, with storage of numbers as well as division.

To answer your last question, this would be most noticeable with large a and small (less than 1) b, because then you end up with a very large quotient, in the range at which all floating point numbers are less granular.

It would be helpful if you showed an example, like @EugeniuRosca did, where it produces different results. — Barmar, Jul 10 '15 at 23:22

Eugeniu Rosca · Answer 4 · 2015-07-10T23:43:12.470

1

Running this on x86 (GCC 4.9.3):

#include "stdio.h"
int main(int arc, char **argv)
{
    float a=73;
    float b=19;

    float ans1 = (a*a*a/b/b/b);
    float ans2 = ((double)a*(double)a*(double)a/(double)b/(double)b/(double)b);
    printf("plain: %f\n", ans1);
    printf("cast:  %f\n", ans2);
    return 0;
}

outputs:

plain: 56.716282
cast:  56.716286

The same operations in a Windows calculator return:

56.716285172765709287068085726782

Clearly, the second result has greater accuracy.

edited Jul 10 '15 at 23:43

answered Jul 10 '15 at 23:20

Eugeniu Rosca

5,177
16
45

1

The generated code (by gcc) simply prints precompiled constants. A most odd result (which matches your output). The actual value, via a calculator, is `56.716285173` – wallyk Jul 10 '15 at 23:36
1

[`void main()`](http://stackoverflow.com/q/204476/995714) is incorrect. It's just an extension of some compilers like Turbo C or MSVC. It's not a standard feature – phuclv Jul 10 '15 at 23:40
2

@Lưu Vĩnh Phúc: i don't think avoiding compiler warnings is the subject of this question. Still, I will correct that for you. – Eugeniu Rosca Jul 10 '15 at 23:41

Converting single-precision floating point numbers to double-precision for division

4 Answers4