4

I need help and sample code on how to detect floating point underflow in standard way with out using third party libraries and exceptions for both signed and unsigned. I googled and found that various authors are talking about "gradual underflow counts as underflow or not" ? Can any one explain me what gradual underflow? I need to check both scenarios

Thanks for your time and help

venkysmarty
  • 11,099
  • 25
  • 101
  • 184
  • @ Mystical Request to provide a sample code how we can do this – venkysmarty Mar 28 '13 at 05:54
  • As numbers approach 0 the precision will start decreasing dramatically too - do you care about that, or just the 0 case? – Tony Delroy Mar 28 '13 at 05:55
  • If denormalized numbers are implemented then underflow isn't something you really need to worry about. – Patashu Mar 28 '13 at 05:55
  • 3
    [What Every Computer Scientist Should Know About Floating-Point Arithmetic](http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html) is a must read. It covers `gradual underflow` as well. – Alexey Frunze Mar 28 '13 at 05:59
  • Yes, underflow will still get signaled if you underflow to denormalized numbers. – Patashu Mar 28 '13 at 05:59
  • 3
    Don't create duplicate questions. – Alexey Frunze Mar 28 '13 at 05:59
  • This is NOT a duplicate question. This is specifically about C. The linked "duplicate" is specifically about C++. – SO Stinks Aug 14 '21 at 04:16
  • @sos This question is tagged both [tag:c] and [tag:c++]. The duplicate offers solutions in C and C++. Whether this qualifies as a duplicate is hard to say. After all, this question is really two questions. – IInspectable Oct 08 '21 at 16:50

4 Answers4

3

A look at the top search results for “gradual underflow” does not show a clear and direct answer, so:

IEEE-754 binary floating-point numbers have a regular pattern through most of their range: There is a sign, an exponent, and a significand with a set number of bits (24 for 32-bit float, 53 for 64-bit double). However, the pattern must be interrupted at the ends. At the high end, results too large for the largest exponent are changed to infinity. At the low end, we have a choice.

One choice would be that if the result is lower than the lowest exponent, the result is rounded to zero. However, IEEE-754 uses a different scheme called gradual underflow. The lowest exponent is reserved for a different format than is used for regular exponents.

With a normal exponent, the 24-bit significand is “1.” followed by 23 bits that are encoded in the significand field. When the number is subnormal, the exponent has the same value as the lowest regular exponent, but the 24-bit significand is “0.” followed by 23 bits. This is gradual underflow because, as numbers get smaller, they have less and less precision (more of the leading bits in the significand are zero) before we reach zero.

Gradual underflow has some nice mathematical properties, notably that a-b == 0 if and only if a == b. With sudden underflow, it would be possible that that a-b == 0 even if a and b are different, because a-b is too small to be represented in the floating-point format. With gradual overflow, all possible values of a-b, for small a and b, are representable because they are just differences in that significand with the lowest exponent.

Another issue in determining whether floating-point underflow has occurred is that implementations are permitted (by the IEEE-754 standard) to report underflow based on a test either before or after rounding. When calculating a result, a floating-point implementation effectively has to do these steps:

  • Calculate the sign, exponent, and significand of the exact result.
  • Round the result to fit within the floating-point format. If the significand rounds up, this may increase the exponent.

The standard allows the implementation to report underflow with either:

  • Calculate the sign, exponent, and significand of the exact result.
  • Is the exponent smaller than the normal range? If so, report underflow.
  • Round the result to fit within the floating-point format. If the significand rounds up, this may increase the exponent.

or:

  • Calculate the sign, exponent, and significand of the exact result.
  • Round the result to fit within the floating-point format. If the significand rounds up, this may increase the exponent.
  • Is the exponent smaller than the normal range? If so, report underflow.

Thus, two different implementations of floating-point may return different reports about underflow for the same calculation.

(There are some additional rules about handling underflow. The above causes an underflow exception to be signaled. However, if traps from this exception are not enabled and the result is exact [rounding did not change anything], then the “underflow” is ignored, and the underflow flag is not raised. If the result is inexact, then the underflow is raised and an inexact exception is signaled.)

Eric Postpischil
  • 195,579
  • 13
  • 168
  • 312
1

Check the floating point status word for the 'underflow' bit, or trap the underflow floating point exception's signal. See:

http://www.gnu.org/software/libc/manual/html_node/FP-Exceptions.html

Patashu
  • 21,443
  • 3
  • 45
  • 53
0
template <typename T>
class Real
{
  public:
    Real(T x) : x_(x) { }
    Real& operator/=(T rhs)
    {
        if (x_)
        {
            x_ /= rhs;
            if (!x_)
                do whatever you want for underflow...
        }
    }
    friend Real operator/(Real lhs, Real rhs)
        { return lhs /= rhs; }
    // similar for -= etc.
  private:
    T x_;
}
Tony Delroy
  • 102,968
  • 15
  • 177
  • 252
0

First you need to enable the generation of floating point exceptions in your build. Second, you have to catch them in your code.

I did something like this (using Visual Studio)

    void translateFPException( unsigned int u, EXCEPTION_POINTERS* pExp )
    {
    unsigned int fpuStatus = _clearfp(); // clear the exception 
    switch (u)
    {
    case STATUS_FLOAT_DENORMAL_OPERAND: 
        throw fe_denormal_operand(fpuStatus);
    case STATUS_FLOAT_DIVIDE_BY_ZERO: 
        throw fe_divide_by_zero(fpuStatus);
    case STATUS_FLOAT_INEXACT_RESULT: 
        throw fe_inexact_result(fpuStatus); 
    case STATUS_FLOAT_INVALID_OPERATION: 
        throw fe_invalid_operation(fpuStatus);  
    case STATUS_FLOAT_OVERFLOW: 
        throw fe_overflow(fpuStatus);   
    case STATUS_FLOAT_UNDERFLOW: 
        throw fe_underflow(fpuStatus);  
    case STATUS_FLOAT_STACK_CHECK: 
        throw fe_stack_check(fpuStatus);    
    default:
        throw float_exception(fpuStatus);
    };
    }

void initializeFloatingPointExceptionHandling()
{
    unsigned int fpControlWord = 0x00;

    _clearfp(); // always call _clearfp before enabling/unmasking a FPU exception
    // enabling an exception is done by clearing the respective bit in the control word
    errno_t success = _controlfp_s( &fpControlWord, 
                                    0xffffffff^( _EM_INVALID 
                                               | _EM_ZERODIVIDE 
                                               | _EM_OVERFLOW
                                               | _EM_DENORMAL
                                               | _EM_UNDERFLOW
                                               | _EM_INEXACT), _MCW_EM );
    if (success != 0)
    {
        stringstream errStream;
        errStream << success << " " << __FILE__ << ":" << __LINE__ << std::endl;
        throw exception(errStream.str().c_str());
    }

    _se_translator_function old =_set_se_translator(translateFPException);
}
void enableAllFPUExceptions()
{
    unsigned int oldMask = 0x00000000;

    _clearfp();
    // enabling is done by clearing the respective bit in the mask
    _controlfp_s(&oldMask, 0x00, _MCW_EM);
}

void disableAllFPUExceptions()
{
    unsigned int oldMask = 0x00000000;

    _clearfp();
    // disabling is done by setting the respective bit in the mask
    _controlfp_s(&oldMask, 0xffffffff, _MCW_EM);
}

You then have to write your own exceptions (also just an excerpt, but you should get the concept)

    class float_exception : public exception
{
public:
    float_exception(unsigned int fpuStatus);

    unsigned int getStatus();
protected:
    unsigned int fpuStatus;
};

class fe_denormal_operand : public float_exception
{
public:
    fe_denormal_operand(unsigned int fpuStatus);
};
ogni42
  • 1,232
  • 7
  • 11