0

I am coding a little calculator in C for my exam preparation. I understand that double is more precise than float since it has 11 bits reserved for the exponent and 53 bits for the significand. When it comes to integers, I can do the following to catch Over/underflows.

int sum(int a, int b, int *res){
    if((b > 0) && (a > INT_MAX + b)){
        return OVERFLOW_ERROR;
    }
    else if((b < 0) && (a < INT_MAX + b)){
        return UNDERFLOW_ERROR;
    }else {
        *res = a + b; 
    }

    return (EXIT_SUCCESS);
}

When it comes to double, if the number is too high, the console will give you "inf" or "-inf", which in any case isn´t too bad. AFAIK, floating numbers overflow, when they lose precision enter image description here

So, my question is, how do you handle the loss of precision? Can you make them "precise"? When do they lose precision?

Jordi
  • 11
  • 2
  • Making floating-point numbers always precise requires an infinite amount of RAM. – Martin James Feb 04 '21 at 19:13
  • Floating point calculations *always* "lose precision". – Eugene Sh. Feb 04 '21 at 19:14
  • One of these may help: https://www.google.com/search?q=floating+point+precision – jwdonahue Feb 04 '21 at 20:18
  • Precision refers to the number of bits in the significand—the fineness with which they can represent values. Accuracy is closeness to the ideal result. Your calculations may lose accuracy, but they do not lose precision unless you convert to a less precise format or your computer is broken or you do calculations near the edge of the exponent range so that low bits are below what is representable. – Eric Postpischil Feb 04 '21 at 20:18
  • Actually this article explicitly answers your questions: https://blog.demofox.org/2017/11/21/floating-point-precision/ – jwdonahue Feb 04 '21 at 20:20
  • 1
    Most floating-point algorithms are designed to tolerate some loss of accuracy, and most cannot avoid it. Exact calculations can be done with floating-point in limited situations with special care. This is not likely a course you want to pursue for casual use of floating-point. Also, hardware commonly allows enabling traps for floating-point exceptions, so you could enable traps for operations that produce inexact results. Software support for this is not always good. Even when it is available, enabling it may cause traps in other parts of your program. – Eric Postpischil Feb 04 '21 at 20:21
  • `INT_MAX + b` will always overflow if b is positive – phuclv Feb 05 '21 at 00:20
  • Few more links. See also: [Is floating point math broken?](http://stackoverflow.com/questions/588004/is-floating-point-math-broken) and [Why Are Floating Point Numbers Inaccurate?](https://stackoverflow.com/questions/21895756/why-are-floating-point-numbers-inaccurate) and [Floating point comparison `a != 0.7`](https://stackoverflow.com/questions/6883306/floating-point-comparison-a-0-7) – David C. Rankin Feb 05 '21 at 07:43

2 Answers2

0

It's been a while since I looked at this properly, but it sounds like you're mixing up your terms - overflow (a numerical value becoming too large) is different to loss of precision (chopping off part of the significand).

IIRC, loss of precision happens either when converting to a shorter floating-point formats or when floating-point numbers become sub-normal/denormalized, so if you really want the greatest precision possible, use long double (or see if your compiler supports a wider floating-point format) and check for sub-normal numbers at each stage of a calculation. You can't make any floating-point number/calculation "absolutely precise" unless you know you're only dealing with numbers that can be represented exactly (e.g. 0.5, 0.25, 0.125, etc.) and don't do crazy things like add two numbers of wildly different magnitudes together.

Generally, dealing with these sorts of numerical errors is pretty involved, and specific to the calculation being done - e.g. you might re-arrange an equation to such that you avoid subtracting two numbers that are very close to each other in value so you don't lose significance.

If you've not come across it, What Every Computer Scientist Should Know About Floating-Point Arithmetic is a fantastic free article, and I can highly recommend Numerical Computing with IEEE Floating Point Arithmetic for a good read.

John Graham
  • 135
  • 6
0

I can recommend you to use libgmp.a or some similar library if you want more precision to do the calculations. I cannot imagine the environment you are going to use it, apart of cryptography or getting more and more decimals of pi, but you have libraries that allow you to extend the capabilities of the natural precision of the computer.

There's an example in free42, which is an emulation to the hp-42s pocket calculator (and implemented by Swissmicros in their range of pocket calculators ---see here, for info) they use 128bit floating point numbers, giving a precision of 32 decimal digits.

But the gain in precision has a penalty (well, not for a simple calculator) is that the operations have to be solved in software, there are not anymore machine instructions to multiply two floating point numbers. Each basic operation must be solved in software, and this slows down the overal calculations.

Luis Colorado
  • 10,974
  • 1
  • 16
  • 31