Real numbers - how to determine whether float or double is required?

Question

Given a real value, can we check if a float data type is enough to store the number, or a double is required?

I know precision varies from architecture to architecture. Is there any C/C++ function to determine the right data type?

yes , possible whether float or double or both are insufficient! — Grijesh Chauhan, Nov 29 '12 at 07:11
What do you mean by "enough"? Do you mean is within the range of the minimum and maximum float values? Or whether a float can represent it exactly? — David Brown, Nov 29 '12 at 07:33
There is no such C++ function. It is your responsibility to determine the precision required. And based on that - select the float or double value's representation. — SChepurin, Nov 29 '12 at 07:41
@JakobS. Pubby is correct. The *range* is from negative to positive infinity. — John Bartholomew, Nov 29 '12 at 07:51
@JohnBartholomew: It is not. Besides there is a representation for negative and positive infinity itself, there is a huge gap in numbers of course, how should there be none for any finite representation. The range of representable numbers is something like 10^-45..10^38 for `float`, e.g. — Jakob S., Nov 29 '12 at 07:55
@JakobS. The gap is irrelevant to the range, it simply means that between an infinity and the corresponding largest magnitude finite number there are no representable values, just as there are no representable values between two adjacent representable finite values. — John Bartholomew, Nov 29 '12 at 07:56
@ John Bartholomew - Sometimes it is better to simply provide a link to the reliable source - http://en.wikipedia.org/wiki/Floating_point#Range_of_floating-point_numbers — SChepurin, Nov 29 '12 at 07:59
@SChepurin: Of course floating point formats have a largest representable finite number (two, if you include both positive and negative), and of course there is a large gap between those numbers and the nearest infinity. But floating point formats *can* represent both positive and negative infinity, and therefore their range *does* extend from negative to positive infinity. What part of that do you disagree with? — John Bartholomew, Nov 29 '12 at 08:06
@ John Bartholomew - Now, there is no problem. But still, the questions like this one (explanation but not a solution) almost always are better answered with linked source information. — SChepurin, Nov 29 '12 at 08:32

score 4 · Answer 1 · answered Nov 29 '12 at 09:09

For background, see What Every Computer Scientist Should Know About Floating-Point Arithmetic

Unfortunately, I don't think there is any way to automate the decision.

Generally, when people represent numbers in floating point, rather than as strings, the intent is to do arithmetic using the numbers. Even if all the inputs fit in a given floating point type with acceptable precision, you still have to consider rounding error and intermediate results.

In practice, most calculations will work with enough precision for usable results, using a 64 bit type. Many calculations will not get usable results using only 32 bits.

In modern processors, buses and arithmetic units are wide enough to give 32 bit and 64 bit floating point similar performance. The main motivation for using 32 bit is to save space when storing a very large array.

That leads to the following strategy:

If arrays are large enough to justify spending significant effort to halve their size, do analysis and experiments to decide whether a 32 bit type gives good enough results, and if so use it. Otherwise, use a 64 bit type.

Vector computing (e.g. SSE) may get twice the throughput through the same ALU using single precision vs double, so 64-bit ALUs being commonplace isn't a good argument. Likewise you can fit twice as many 32-bit numbers through a data bus in the same amount of time, regardless of the width of the bus. The motivation for making things smaller is performance. Anyway, some kind of analysis of precision is usually warranted, since without that you can be blindsided by a precision bug in 64-bit just as in 32-bit. — Potatoswatter, Dec 02 '12 at 09:24

score 3 · Answer 2 · edited May 23 '17 at 12:26

I think your question presupposes a way to specify any "real number" to C / C++ (or any other program) without precision loss.

Suppose that you get this real number by specifying it in code or through user input; a way to check if a float or a double would be enough to store it without precision loss is to just count the number of significant bits and check that against the data range for float and double.

If the number is given as an expression (i.e. 1/7 or sqrt(2)), you will also want ways of detecting:

If the number is rational, whether it has repeating decimals, or cyclic decimals.
Or, What happens when you have an irrational number?

More over, there are numbers, such as 0.9, that float / double cannot in theory represent "exactly" )at least not in our binary computation paradigm) - see Jon Skeet's excellent answer on this.

Lastly, see additional discussion on float vs. double.

Potatoswatter · Answer 3 · 2020-01-30T11:29:30.773

3

Precision is not very platform-dependent. Although platforms are allowed to be different, float is almost universally IEEE standard single precision and double is double precision.

Single precision assigns 23 bits of "mantissa," or binary digits after the radix point (decimal point). Since the bit before the dot is always one, this equates to a 24-bit fraction. Dividing by log2(10) = 3.3, a float gets you 7.2 decimal digits of precision.

Following the same process for double yields 15.9 digits and long double yields 19.2 (for systems using the Intel 80-bit format).

The bits besides the mantissa are used for exponent. The number of exponent bits determines the range of numbers allowed. Single goes to ~ 10^±38, double goes to ~ 10^±308.

As for whether you need 7, 16, or 19 digits or if limited-precision representation is appropriate at all, that's really outside the scope of the question. It depends on the algorithm and the application.

edited Jan 30 '20 at 11:29

answered Nov 30 '12 at 01:50

Potatoswatter

134,909
25
265
421

For `double` shouldn't it be `log10(2^53) = 15.95` digits. – Raj Jan 30 '20 at 07:14
@Raj The implicit leading `1` also counts even though it doesn’t take storage space. – Potatoswatter Jan 30 '20 at 07:21
`52` bits for mantissa and an implicit leading 1?? So total 53. Am I missing something? – Raj Jan 30 '20 at 07:28

score 1 · Answer 4 · answered Nov 29 '12 at 08:00

1

A very detailed post that may or may not answer your question.

An entire series in floating point complexities!

answered Nov 29 '12 at 08:00

jonathanasdf

2,844
2
23
26

Umm, I read the first dozen or so items in the series on floating point complexities, and they're at best oversimplified and at worst downright wrong. For example, "FLT_MIN is not the smallest positive float (FLT_MIN is the smallest positive normalized float)" is true **if** your hardware does subnormals. Most does, but not all. And that's why `std::numeric_limits` has a Boolean member named `has_denorm`. – Pete Becker Nov 29 '12 at 13:03
That particular article does state that it is talking about the IEEE 754 standard, in which subnormals ARE defined. If your hardware does not happen to be standards compliant, then you can hardly blame an article about the standard to be wrong regarding your hardware. The articles might be oversimplified, but for someone with no knowledge of the whole floating-point business, I feel it is at the right level of complexity. – jonathanasdf Nov 29 '12 at 16:16
I only looked at the first page, but I don't see where it says it's about IEEE 754. Regardless, C++ does not require IEEE 754. The problem most people have with floating-point arithmetic is that their view of it is oversimplified; yet another oversimplification doesn't help that. – Pete Becker Nov 29 '12 at 16:47
@PeteBecker For a large majority of programmers, assuming that their programming platform provides them with IEEE 754 floating-point arithmetics and understanding what this means (with some of the implications listed on http://www.altdevblogaday.com/2012/04/05/floating-point-complexities/ ) would be a huge improvement. – Pascal Cuoq Nov 30 '12 at 01:59
@PascalCuoq - sure, if it's **stated clearly** that what's being said applies to IEEE 754 implementations. My objection to the article in question is that it provides cute generalities without supplying that context. – Pete Becker Nov 30 '12 at 12:20

score 0 · Answer 5 · edited Nov 29 '12 at 07:46

You cannot represent real number with float or double variables, but only a subset of rational numbers.

When you do floating point computation, your CPU floating point unit will decide the best approximation for you.

I might be wrong but I thought that float (4 bytes) and double (8 bytes) floating point representation were actually specified independently of comp architectures.

score 0 · Answer 6 · answered Nov 29 '12 at 07:19

0

Couldn't you simply store it to a float and a double variable and than compare these two? This should implicitely convert the float back to a double - if there is no difference, the float is sufficient?

float f = value;
double d = value;
if ((double)f == d)
{
     // float is sufficient
}

answered Nov 29 '12 at 07:19

Jakob S.

1,851
2
14
29

Please, do not suggest the solution like this one. Float and double are different in many aspects. – SChepurin Nov 29 '12 at 08:04
@Angew - I leave it to your research. But you can freely disagree with that. – SChepurin Nov 29 '12 at 08:35
if you cast double to float and then back to double, result is almost(*) never equal to original value, even if the original value can be represented as float (up to its precision) – Victor K Nov 29 '12 at 08:56
2

@VictorK: What do you mean that, if the original value can be represented as float, converting to float and back to double almost never produces the original value? If the value in a double is exactly representable as a float, then both conversions produce the exact value; there is no change. – Eric Postpischil Nov 29 '12 at 12:49
@Eric Postpischil - Note that the question was about precision. Handling float and double representation for a value you most likely will have to take care about different formatting like std::setprecision. – SChepurin Nov 29 '12 at 14:10
@SChepurin: That statement does not appear to be related to my question. – Eric Postpischil Nov 29 '12 at 14:37
@Eric Postpischil - Agree:) This is kinda twisted discussion. Just wanted to provide one of the reasons do not implement this solution. – SChepurin Nov 29 '12 at 14:41
@Eric Postpischil: This is exactly what I had in mind. In all other cases I would say: `float` is not sufficient, as the number is not exactly representable as `float` and therefore "something" is lost. Whether you do or do not care about that "something" has to be decided by the developer and not the machine. – Jakob S. Nov 30 '12 at 07:08
@Eric Postpischil `double` has 53-bit significand, `float` has 24-bit significand, when you convert double to float, you lose 29 bits, even if number is within min/max values for single-precision float (I didn't say if it can be represented _exactly_; I guess, it's my bad choice of words) – Victor K Nov 30 '12 at 13:36
1

@VictorK: The code in this answer is intended to detect whether a double is exactly representable as a float. Given that, the behavior you describe is not a criticism; it supports the purpose of the code: A double that cannot be exactly represented by a float is altered by the round-trip conversions, and a double that can be exactly represented by a float is not altered. That is the intent. – Eric Postpischil Nov 30 '12 at 14:18
ok, I agree that it precisely answers the question. it's the question that i find err... questionable. what's the problem OP is trying to solve? – Victor K Nov 30 '12 at 15:07

Real numbers - how to determine whether float or double is required?

6 Answers6

Linked

Related