2

I'm not sure if what I've done is the best way of going about the problem:

0010 0010 0001 1110 1100 1110 0000 0000

I split it up:

Sign : 0 (positive)

Exponent: 0100 0100 (in base 2) -> 2^2 + 2^6 = 68 -> excess 127: 68 - 127 = -59 (base 10)

Mantissa: (1).001 1110 1100 1110 0000 0000 -> decimal numbers needed: d-10 = d-2 * log2 / log10 = 24 * log2 / log10 = 7.22 ~ 8 (teacher told us to round up always)

So the mantissa in base 10 is: 2^0 + 2^-3 + 2^-4 + 2^-5 + 2^-6 + 2^-8 + 2^-9 + 2^-12 + 2^-13 + 2^-14 = 1.2406616 (base 10)

Therefore the real number is:

+1.2406616 * 2^(-59) = 2.1522048 * 10^-18

But is the 10^x representation good? How do I find the right number of sig figs? Would it be the same as the rule used above?

user3251142
  • 175
  • 7
  • 16

1 Answers1

1

The representation is almost good. I'd say your need a total of 9 (you have 8) significant digits.

See Printf width specifier to maintain precision of floating-point value

The right number of significant digits depends on what is right means.

If you want to print out to x significant decimal places, and read it back and be sure you have the same number x again, then for all IEEE-754 single, a total of 9 decimal places is needed in. 1 before and 8 after the '.' in scientific notation. You may get by with less digits for some numbers, but some numbers need as many as 9.

In C this is defined as FLT_DECIMAL_DIG.

Printing more than 9 does not hurt, it just does not convert to a different IEEE-754 single precision number had only 9 been used.

OTOH if you start with a textual decimal number with y significant digits, convert it to IEEE-754 single and then back to text, then the most y digits you should count on always working is 6.

In C this is defined as FLT_DIG.


So at the end, I'd say d-10 = d-2 * log2 / log10 is almost right. But since powers of 2 (IEEE-754 single) and powers of 10 (x.xxxxxxxx * 10 ^ expo) to not match (expect at 1.0) the precision to use with text is FLT_DECIMAL_DIG:

"number of decimal digits, n, such that any floating-point number with p radix b digits can be rounded to a floating-point number with n decimal digits and back again without change to the value,

p log10 b              if b is a power of 10
ceiling(1 + p log10 b) otherwise"

9 in the case of IEEE-754 single

Community
  • 1
  • 1
chux - Reinstate Monica
  • 143,097
  • 13
  • 135
  • 256
  • I'm a little confused by that link, but this is what my teacher said in class: there are 23 bits in the mantissa, and 1 hidden bit (1.mantissa). Using the calculation d-base_a = d-base_b * log(base_b) / log(base_a), I find that the number of sig figs required for base 10 is **8**. So when putting it into base 10 format I have 8 numbers including the (1.). My question is if by this logic, should that mean that the right side of my bolded answer should have the same sig figs (8), even though it's a number times a power of 10 rather than a power of 2? – user3251142 Feb 04 '14 at 19:06
  • The number of digits needed is `ceiling (1 + p log10 b) and (b not a pwoer of 10)` where p is 24 and b is 2 which makes 9. The issue centers around where powers of 2 do not well line up with powers of 10, hence the +1. Need some more time to present a good example. – chux - Reinstate Monica Feb 04 '14 at 19:34
  • `1.000001049041748e+01 or 0x1.400016p+3` (a number just bigger than 10) converted to 1+7 digits is `"1.0000010e+01"`. Converting this text back to IEEE-754 single result in `1.000000953674316e+01 or 0x1.400014p+3` which is the preceding FP number representably IEEE-754 single. – chux - Reinstate Monica Feb 04 '14 at 19:47
  • @user3251142 Using 1+7 digits, between 10 and 16 there are 6*1,000,000 different representations of 1.0000000 * 10^1 and 1.6000000 * 10^1. But in IEEE-754 single, between 10 (0x1.4p+3) and 16 (0x1.fffffep+3) there are 3/4 2^23 or 6,291,456 different numbers. So printing IEEE-754 single numbers in this range need 1 more decimal place, as 6,000,000 textual representations is not enough for 6,291,456 numbers. – chux - Reinstate Monica Feb 04 '14 at 19:57
  • @user3251142 Again, the issue is that the powers of the 2 bases 2 and 10 do not line up (except at 1.0). So the worst case precision of one needs to be compared against the best case of the other. Hence the +1 to the equation going from 10 to 2. Had the bases lined up, like 1000 and 10, `d-base_b * log(base_b) / log(base_a)` would have worked. – chux - Reinstate Monica Feb 04 '14 at 20:02
  • i have one other offtopic question. how would you convert a big decimal number such as 3.4219087*10^12 to ieee-754? Should I just write it out and keep dividing by 2 to see what the number is in binary? and then convert it to a "1.x * 2^y" number? and then find out the exponent, and ill have the full number? – user3251142 Feb 05 '14 at 17:32
  • @user3251142 For a `float` where 10^expo and -40 < expo < 40 then yes, simplest to /2 (or *2 if expo < 0). For other wide exponents where expo could be in the hundreds or thousands a different approach should be used. In that case, main idea is to convert "10^expo" (using a algorithm that depends on the expo's bit width) into a `x * 2^ expo2` and then multiply x to the original number's significand. This product, near 1.0, is then *2 the number of times needed for the desired precision. – chux - Reinstate Monica Feb 05 '14 at 17:41
  • I see. I have one more question. Should I simply truncate the unneeded digits (not part of the mantissa - 23 digits long) after converting it to a "1.x * 2^y", or what exactly? What if there is a 1 after the 23rd digit? – user3251142 Feb 05 '14 at 18:07
  • @user3251142 Truncate: no. Rounding should occur per the current rounding mode. If mode is in doubt, [round to even](http://en.wikipedia.org/wiki/Round_to_even#Round_half_to_even). Note: with 20+ rep points and you can enter chat rooms. – chux - Reinstate Monica Feb 05 '14 at 19:19
  • what do you mean by mode? and so am i right to say that "...1001|1100..." should end up being "...1010" – user3251142 Feb 05 '14 at 19:25
  • @user3251142 IEEE-754 computations are finely controlled by various mode settings. The rounding mode directs how to do rounding. Truncating (round to 0) is one. Round-to-nearest with ties to even is another and IMO most popular. Yes, "...1001|1100..." --> "...1010". The easiest way to do this is look at "...100a|bcccc...ccc" where C is true if any c is true. Of these 8 combination a-b-C, round up (add 1 to the significand) if ((a and b) or (b and C)). Notice that this "add 1" could increment the exponent in the case of 1.1111...111|1. Further, the incremented exponent could overflow. – chux - Reinstate Monica Feb 05 '14 at 19:48