0

Here: http://docs.oracle.com/javase/specs/jls/se8/html/jls-4.html#jls-4.2.3 it says that:

The finite nonzero values of any floating-point value set can all be expressed in the form s · m · 2^(e - N + 1), where s is +1 or -1, m is a positive integer less than 2^N, and e is an integer between Emin = -(2^(K-1)-2) and Emax = 2^(K-1)-1, inclusive, and where N and K are parameters that depend on the value set.

and there is a table below:

Parameter   float   
N              24   
K              8    

So let's say N = 24 and K = 8 then we can have the following value from the formula: s · 2^N · 2^(2^(K-1)-1 - N + 1) which gives us according to values specified in the table: s * 2^24 * 2^(127 - 24) which is equal to s * 2^127. But float has only 32 bits so it's not possible to store in it such a big number.

So it's obvious that initial formula should be read in a different way. How then? Also in javadoc for Float max value: http://docs.oracle.com/javase/7/docs/api/java/lang/Float.html#MAX_VALUE

it says:

A constant holding the largest positive finite value of type float, (2-2^-23)·2^127

This also doesn't make sense, as resulting value is much larger than 2^32 - which is possible the biggest value that can be stored in float variable. So again, I'm misreading this notation. So how it should be read?

dimo414
  • 47,227
  • 18
  • 148
  • 244
dhblah
  • 9,751
  • 12
  • 56
  • 92
  • Floating point in Java is basically [IEEE 754 standard.](http://754r.ucbtest.org/standards/754xml.html) – markspace Jan 09 '15 at 17:44
  • Try `System.out.println(Float.MAX_VALUE)` - does it not print `3.4028235E38` (which is [`(2-2^-23)·2^127`](http://www.wolframalpha.com/input/?i=%282-2^-23%29%C2%B72^127)? The formulas should be read exactly as written. your error is in assuming that "*float has only 32 bits so it's not possible to store in it such a big number.*" – dimo414 Jan 09 '15 at 17:58
  • possible duplicate of [Precision of Floating Point](http://stackoverflow.com/questions/872544/precision-of-floating-point) – dimo414 Jan 09 '15 at 17:59
  • Your assumption that 2^32 is the biggest value that can be stored in 32 bits is wrong. It uses 8 bits for an _exponent_, which can represent a number between -128 and 127, 23 bits for the value being multiplied by 2 to that exponent, and 1 bit for the sign. – Louis Wasserman Jan 09 '15 at 18:02

1 Answers1

2

The idea with the floating point notation is to store a much larger range of numbers than can be stored in the same space (bytes) with the integer representation. So, for example, you say that the "resulting value is much larger than 2^32". But, that would only be a problem if we're storing a typical binary number as one computes in a typical math class.

Instead, floating point representations break those 32 bytes into two main parts: - significand - exponent

For simplicity, imagine that 3 bytes are used for the significand and 1 byte for the exponent. Also assume that each of these is your typical binary integer style of representation. So, the three bytes can have a value 2^24, or 2^23 if you want to keep one bit for the sign. However, the other byte can store up to 2^7 (if you want a sign there too).

So, you could express 500^100, by storing the 500 in the three bytes and the 100 in the 1 byte.

Essentially, one cannot store every number precisely. One changes it into significant form and one can store as many significant digits as the portion reserved for the significand (3 bytes in this example).

Rather than try to explain the complications, check this Wikipedia article for more.

Darius X.
  • 2,886
  • 4
  • 24
  • 51