Reliable overflow detection of floating-point/integer type conversion

Question

Is there a safe way to reliably determine if an integral type T can store a floating-point integer value f (so f == floor(f)) without any overflow?

Keep in mind that there is no guarantee that the floating point type F is IEC 559 (IEEE 754) compatible, and that signed integer overflow is undefined behavior in C++. I'm interested in a solution which is correct according to the current C++ (C++17 at the writing) standard and avoids undefined behavior.

The following naive approach is not reliable, since there is no guarantee that type F can represent std::numeric_limits<I>::max() due to floating-point rounding.

#include <cmath>
#include <limits>
#include <type_traits>

template <typename I, typename F>
bool is_safe_conversion(F x)
{
    static_assert(std::is_floating_point_v<F>);
    static_assert(std::is_integral_v<I>);

    // 'fmax' may have a different value than expected
    static constexpr F fmax = static_cast<F>(std::numeric_limits<I>::max());

    return std::abs(x) <= fmax; // this test may gives incorrect results
}

Any idea?

[This was answered for C](https://stackoverflow.com/questions/51104995/can-a-conversion-from-double-to-int-be-written-in-portable-c/51107035#51107035), and the solution there should serve for C++ as well. The essential approach serves in C++: Use the characteristics of the floating-point type to safely find the greatest representable floating-point value less than `INT_MAX`+1 and the least value greater than `INT_MIN`−1, and then floating-piont values can be directly compared to those two bounds. — Eric Postpischil, Jul 12 '18 at 12:08
@EricPostpischil But C++ might yet allow other approaches not applicable to C... — Aconcagua, Jul 12 '18 at 12:11
@EricPostpischil Still, the answer there is good - why don't you post an answer like "The same problem was solved in C [link] already, the solution is applicable in C++ as well."? — Aconcagua, Jul 12 '18 at 12:26
@Aconcagua: I would, but it ought to be modified for C++ things, like including `` instead of ``, and I do not have time right now—I am about to go on a road trip for the day. Feel free to copy and edit it, with credit. Otherwise, I might get to it in the next few days. — Eric Postpischil, Jul 12 '18 at 12:28
I'd say it's OK just mentioning those changes required to the old answer in the new one. Additionally, there is quite a lot of C API outside being used from C++ which doesn't need to be changed to work either... — Aconcagua, Jul 12 '18 at 12:47
@Aconcagua: I updated my C answer for C++ and entered it. There are some minor changes. One is the requirement that `FLT_RADIX` be representable in `int` is gone—C++ guarantees that because `std::numeric_limits<*type*>::radix` is an `int`. — Eric Postpischil, Jul 13 '18 at 11:24

score 3 · Accepted Answer · edited Jul 13 '18 at 12:36

3

Is there a safe way to reliably determine if an integral type T can store a floating-point integer value f?

Yes. The key is to test if f is in the range T::MIN - 0.999... to T::MAX + 0.999... using floating point math - with no rounding issues. Bonus: rounding mode does not apply.

There are 3 failure paths: too big, too small, not-a-number.

The below assumes int/double. I'll leave the C++ template forming for OP.

Forming exact T::MAX + 1 exactly using floating point math is easy as INT_MAX is a Mersenne Number. (We are not talking about Mersenne Prime here.)

Code takes advantage of:
A Mersenne Number divided by 2 with integer math is also a Mersenne Number.
The conversion of a integer type power-of-2 constant to a floating point type can be certain to be exact.

#define DBL_INT_MAXP1 (2.0*(INT_MAX/2+1)) 
// Below needed when -INT_MAX == INT_MIN
#define DBL_INT_MINM1 (2.0*(INT_MIN/2-1))

Forming exact T::MIN - 1 is hard as its absolute value is usually a power-of-2 + 1 and the relative precision of the integer type and the FP type are not certain. Instead code can subtract the exact power of 2 and compare to -1.

int double_to_int(double x) {
  if (x < DBL_INT_MAXP1) {
    #if -INT_MAX == INT_MIN
    // rare non-2's complement machine 
    if (x > DBL_INT_MINM1) {
      return (int) x;
    }
    #else
    if (x - INT_MIN > -1.0) {
      return (int) x;
    }
    #endif 
    Handle_Underflow();
  } else if (x > 0) {
    Handle_Overflow();
  } else {
    Handle_NaN();
  }
}

Regarding floating-point types with non-binary radix (FLT_RADIX != 2)

With FLT_RADIX = 4, 8, 16 ..., the conversion would be exact too. With FLT_RADIX == 10, code is at least exact up to a 34-bit int as a double must encode +/-10^10 exactly. So a problem with say a FLT_RADIX == 10, 64-bit int machine - a low risk. Based on memory, the last FLT_RADIX == 10 in production was over a decade ago.

The integer type is always encoded as 2's complement (most common), 1s' complement, or sign magnitude. INT_MAX is always a power-2-minus-1. INT_MIN is always a - power-2 or 1 more. Effectively, always base 2.

edited Jul 13 '18 at 12:36

Aconcagua

24,880
4
34
59

answered Jul 12 '18 at 15:58

chux - Reinstate Monica

143,097
13
135
256

Nice approach, but you assume a base 2 floating-point number. What if the radix of the floating-point type and the integer type is different? – plasmacel Jul 12 '18 at 17:46
@plasmacel With `FLT_RADIX = 4, 8, 16 ....`, the conversion would be exact too. With `FLT_RADIX == 10`, code is _at least_ exact up to a 34-bit `int` as a `double` must encode +/-10^10 exactly. So a problem with say a `FLT_RADIX == 10`, 64-bit `int` machine - a low risk. Based on memory, the last `FLT_RADIX == 10` in production was over a decade ago. – chux - Reinstate Monica Jul 12 '18 at 18:00
@plasmacel The integer type is always encoded as 2's complement (most common), 1s' complement, or sign magnitude. `INT_MAX` is always a power-2-minus-1. `INT_MIN` is always a `-` power-2 or 1 more. Effectively - always base 2. – chux - Reinstate Monica Jul 12 '18 at 18:04
Two's complement representation can be detected as `-1 == ~0`. Do you know how to differentiate one's complement from sign magnitude representation? – plasmacel Jul 12 '18 at 23:18
@plasmacel `-1 & 1 == 0` with 1s' complement. – chux - Reinstate Monica Jul 12 '18 at 23:53
1

@plasmacel `~0` is a potential trap representation on non-2's complement. `-1 == ~0` may trap. – chux - Reinstate Monica Jul 12 '18 at 23:55
Various requirements or limitations of this code should be explicitly stated. `(2.0*(INT_MAX/2+1)` could overflow in a floating-point format without infinity. It could also round down to the greatest finite floating-point value if the implementation uses round-toward-zero or round-toward-minus-infinity instead of round-to-nearest (or uses non-IEEE-754 rules for rounding). – Eric Postpischil Jul 13 '18 at 02:39
@EricPostpischil `(2.0*(INT_MAX/2+1)` could overflow if `int` was 123-bit or more - or are you thinking of some other situation? Aside from such wide `int`, there is no rounding. With IEEE-754 64-bit `double`, there is no inexactness or rounding unless `int` was more than 1024 bits. Perhaps an example you have would shed light on your concern. – chux - Reinstate Monica Jul 13 '18 at 03:28
@EricPostpischil is right about that it would be good if limitations would be explicitly stated. I'm interested in the limits, especially in how to determine the limits in case of different floating-point and `int` types (with arbitrary widths). – plasmacel Jul 13 '18 at 07:48
@chux `2.0*(INT_MAX/2+1)` can overflow if the mathematical result greater than the greatest value representable in the floating-point format, which can occur either because the `int` format (and hence `INT_MAX`) is wide or because the floating-point exponent is fairly limited. It is a limitation of the proposed solution and ought to be stated explicitly. (And, of course, there is no need for this limit, as the code I provided demonstrates that the desired value can be computed regardless of whether `INT_MAX` is bigger or smaller than the maximum representable floating-ponit value.) – Eric Postpischil Jul 14 '18 at 16:00
@chux: Note that the question states that IEEE-754 is not guaranteed, so using the properties of IEEE-754 64-bit `double` is outside the question that was asked. Additionally, the questions asks about generic floating-point and integer types; none of `float`, `double`, or `int` is mentioned. That is why I wrote my code using types parameterized with typedef. It will accept any floating-point and integer types, including converting `float` to `int128_t` or `long double` to `char`, and the only floating-point operations used have exact results in all cases. – Eric Postpischil Jul 14 '18 at 16:30
@EricPostpischil `2.0*(INT_MAX/2+1)` can overflow if the mathematical result greater than the greatest value representable in the floating-point format. Yet consider what it takes for that to occur to create a negative effect: 1) the integer type exceed `int123_t` 2) the _promoted_ FP type only meets the minimal specification of `1e37`. I will grant you this point if one compliant compiler presently exists that uses such a small range `double`. Else we are only in the realm of theoretical/unicorn machines. – chux - Reinstate Monica Jul 14 '18 at 23:46
@EricPostpischil Concerning the solution you have provided, it too has concerns with various unicorn FP types that employ double-double for a `long double` as the numeric_limits of that refer to normalized values and not the denormal values where a finite value can exceed the `::max()`. `FPTypeToInt()` need for a prior `InitializeFPTypeToInt();` (something ought to be stated explicitly) should be built into that function, rather than depend of OP's usage. – chux - Reinstate Monica Jul 14 '18 at 23:56
@chux: (a) Your answer fails to state a requirement. Arguing about what circumstances that requirement is or is not violated are irrelevant to that. As long as you do not edit the answer to state the requirement, it is defective. (b) C requires that `float` support up to 1e37 (and it is not clear to me that C++ does; it is not immediately obvious, and I have not traced the various statements in the specification to such a requirement), but OP does not ask about `float`. They ask about any floating-point type. That could be `__fp16` or some other type that is an extension to the required types. – Eric Postpischil Jul 15 '18 at 00:00
@EricPostpischil This answer does not assume IEEE-754 and does not use the properties of that standard. It does reflect compilers that adhere to the C standard. For further discourse I suggest we chat over any lingering issues of concern. – chux - Reinstate Monica Jul 15 '18 at 00:01
@chux: I do not see how you mean that “denormal” double-double values can exceed `::max()`. Whatever representation the implementation uses for a floating-point type, it ought to set `::max()` to the greatest representable finite value. – Eric Postpischil Jul 15 '18 at 00:04
@chux: The initialization for `FPTypeToInt` can be accomplished by wrapping everything in a class. I have considered this but have not had time to do it. It is irrelevant to the computation and C++ specification issues. – Eric Postpischil Jul 15 '18 at 00:06
@chux: You stated earlier “With IEEE-754 64-bit `double`,…” Now you state “This answer does not assume IEEE-754…” It is contradictory to defend the answer based on the properties of IEEE-754 and to assert the answer does not assume IEEE-754. If the answer does not assume IEEE-754, then the previous comment deriving a conclusion from IEEE-754 is moot. – Eric Postpischil Jul 15 '18 at 00:12
I added automatic initialization to the code in my answer, in the process converting it into a templated class. – Eric Postpischil Jul 15 '18 at 01:06
Can `x - INT_MIN > -1.0` be safely substituted with `x + 1.0 > INT_MIN`? – Emile Cormier May 02 '21 at 00:38
I checked using `x + 1.0 > INT_MIN` at the corner case of `-2147483776.0f` for `float` -> `int`, and I get the same result as `x - INT_MIN > -1.0`. https://godbolt.org/z/Y1frfejz1 – Emile Cormier May 02 '21 at 01:03
The reason I ask is that I'm dealing with a rudimentary 128-bit IEEE floating point number holder with limited arithmetic capabilities. Adding 1.0 is easier to emulate that subtracting an arbitrary value. – Emile Cormier May 02 '21 at 01:06
1

@EmileCormier "Can x - INT_MIN > -1.0 be safely substituted with x + 1.0 > INT_MIN?" --> No, not when `x` has fewer significant digits than `INT_MIN`, else Yes. Your [sample](https://godbolt.org/z/Y1frfejz1) code still uses `double` addition with `1.0` instead of `1.0f`. Adding 1.0 is _easier_, but incorrect in edge cases any time `x + 1.0` is not _exact_. `x - INT_MIN > -1.0` is always correct with 2's compliment as `x - INT_MIN` is always exact when `x` is near `INT_MIN`. – chux - Reinstate Monica May 02 '21 at 21:52
@chux Good catch on the `1.0` vs `1.0f` in my sample code. Thanks! – Emile Cormier May 03 '21 at 00:13

Aconcagua · Answer 2 · 2018-07-12T12:58:48.693

Any idea?

template <typename I, typename F>
constexpr F maxConvertible()
{
    I i = std::numeric_limits<I>::max();
    F f = F(i);
    while(F(i) == f)
    {
        --i;
    }
    return F(i);
}

Due to rounding, we might have got a too large maximum, now downcounting until we get the next representable double being smaller, which should fit into the integral...

Problem left open: This works fine, if conversion to double involves up-rounding; however, even IEEE 754 allows different rounding modes (if rounding to nearest is applied, which should be the most common rounding mode across current hardware, up-rounding will always occur...).

I have not spotted a solution to safely detect down-rounding yet (might add later; at least detecting "rounding to nearest" has already a solution here), if this occurs, we get some negative falses near the maxima and minima of the integral values, you might consider this "acceptable" for those few exotic architectures actually doing down-rounding.

Independent from up- or down-rounding, there is a special case for signed integrals anyway: Provided the integral number is represented in two's complement and has more bits than the mantissa of the floating point value, then the types minimum value will be representable as floating point value whereas some greater values will not. Catching this case requires special treatment.

Eric Postpischil · Answer 3 · 2018-07-15T01:04:32.613

This approach uses the definition of floating-point formats in the C (not C++, see first comment) standard. Knowing the number of digits in the significand (provided by numeric_limits::digits) and the exponent limit (provided by numeric_limits::max_exponent) allows us to prepare exact values as end points.

I believe it will work in all conforming C++ implementations subject to the modest additional requirements stated in the initial comment. It supports floating-point formats with or without infinities, with ranges wider or narrower than the destination integer format, and with any rounding rules (because it uses only floating-point arithmetic with exactly representable results, so rounding should never be needed).

/*  This code demonstrates safe conversion of floating-point to integer in
    which the input floating-point value is converted to integer if and only if
    it is in the supported domain for such conversions (the open interval
    (Min-1, Max+1), where Min and Max are the mininum and maximum values
    representable in the integer type).  If the input is not in range, an error
    throw and no conversion is performed.  This throw can be replaced by any
    desired error-indication mechanism so that all behavior is defined.

    There are a few requirements not fully covered by the C++ standard.  They
    should be uncontroversial and supported by all reasonable C++
    implementations:

        The floating-point format is as described in C 2011 5.2.4.2.2 (modeled
        by the product of a sign, a number of digits in some base b, and base b
        raised to an exponent).  I do not see this explicitly specified in the
        C++ standard, but it is implied by the characteristics specified in
        std::numeric_limits.  (For example, C++ requires numeric_limits to
        provide the number of base-b digits in the floating-point
        representation, where b is the radix used, which means the
        representation must have base-b digits.)

        The following operations are exact in floating-point.  (All of them
        are elementary operations and have mathematical results that are
        exactly representable, so there is no need for rounding, and hence
        exact results are expected in any sane implementation.)

            Dividing by the radix of the floating-point format, within its
            range.

            Multiplying by +1 or -1.

            Adding or subtracting two values whose sum or difference is
            representable.

        std::numeric_limits<FPType>::min_exponent is not greater than
        -std::numeric_limits<FPType>::digits.  (The code can be modified to
        eliminate this requirement.)
*/


#include <iostream> //  Not needed except for demonstration.
#include <limits>


/*  Define a class to support safe floating-point to integer conversions.

    This sample code throws an exception when a source floating-point value is
    not in the domain for which a correct integer result can be produced, but
    the throw can be replaced with any desired code, such as returning an error
    indication in an auxiliary object.  (For example, one could return a pair
    consisting of a success/error status and the destination value, if
    successful.)

    FPType is the source floating-point type.
    IType is the destination integer type.
*/
template<typename FPType, typename IType> class FPToInteger
{
private:

    /*  Wrap the bounds we need in a static object so it can be easily
        initialized just once for the entire program.
    */
    static class StaticData
    {
    private:

        /*  This function helps us find the FPType values just inside the
            interval (Min-1, Max+1), where Min and Max are the mininum and
            maximum values representable in the integer type).

            It returns the FPType of the same sign of x+s that has the greatest
            magnitude less than x+s, where s is -1 or +1 according to whether x
            is non-positive or positive.
        */
        static FPType BiggestFPType(IType x)
        {
            /*  All references to "digits" in this routine refer to digits in
                base std::numeric_limits<FPType>::radix.  For example, in base
                3, 77 would have four digits (2212).  Zero is considered to
                have zero digits.

                In this routine, "bigger" and "smaller" refer to magnitude.  (3
                is greater than -4, but -4 is bigger than 3.) */

            //  Abbreviate std::numeric_limits<FPType>::radix.
            const int Radix = std::numeric_limits<FPType>::radix;

            //  Determine the sign.
            int s = 0 < x ? +1 : -1;

            //  Count how many digits x has.
            IType digits = 0;
            for (IType t = x; t; ++digits)
                t /= Radix;

            /*  If the FPType type cannot represent finite numbers this big,
                return the biggest finite number it can hold, with the desired
                sign.
            */
            if (std::numeric_limits<FPType>::max_exponent < digits)
                return s * std::numeric_limits<FPType>::max();

            //  Determine whether x is exactly representable in FPType.
            if (std::numeric_limits<FPType>::digits < digits)
            {
                /*  x is not representable, so we will return the next lower
                    representable value by removing just as many low digits as
                    necessary.  Note that x+s might be representable, but we
                    want to return the biggest FPType less than it, which, in
                    this case, is also the biggest FPType less than x.
                */

                /*  Figure out how many digits we have to remove to leave at
                    most std::numeric_limits<FPType>::digits digits.
                */
                digits = digits - std::numeric_limits<FPType>::digits;

                //  Calculate Radix to the power of digits.
                IType t = 1;
                while (digits--) t *= Radix;

                return x / t * t;
            }
            else
            {
                /*  x is representable.  To return the biggest FPType smaller
                    than x+s, we will fill the remaining digits with Radix-1.
                */

                //  Figure out how many additional digits FPType can hold.
                digits = std::numeric_limits<FPType>::digits - digits;

                /*  Put a 1 in the lowest available digit, then subtract from 1
                    to set each digit to Radix-1.  (For example, 1 - .001 =
                    .999.)
                */
                FPType t = 1;
                while (digits--) t /= Radix;
                t = 1-t;

                //  Return the biggest FPType smaller than x+s.
                return x + s*t;
            }
        }

    public:

        /*  These values will be initialized to the greatest FPType value less
            than std::numeric_limits<IType>::max()+1 and the least FPType value
            greater than std::numeric_limits<IType>::min()-1.
        */
        const FPType UpperBound, LowerBound;

        //  Constructor to initialize supporting data for FPTypeToInteger.
        StaticData()
            : UpperBound(BiggestFPType(std::numeric_limits<IType>::max())),
              LowerBound(BiggestFPType(std::numeric_limits<IType>::min()))
        {
            //  Show values, just for illustration.
            std::cout.precision(99);
            std::cout << "UpperBound = " << UpperBound << ".\n";
            std::cout << "LowerBound = " << LowerBound << ".\n";
        }

    } Data;


public:


    FPType value;


    //  Constructor.  Just remember the source value.
    FPToInteger(FPType x) : value(x) {}


    /*  Perform the conversion.  If the conversion is defined, return the
        converted value.  Otherwise, throw an exception.
    */
    operator IType()
    {
        if (Data.LowerBound <= value && value <= Data.UpperBound)
            return value;
        else
            throw "Error, source floating-point value is out of range.";
    }
};


template<typename FPType, typename IType>
    typename FPToInteger<FPType, IType>::StaticData
        FPToInteger<FPType, IType>::Data;


typedef double FPType;
typedef int    IType;


//  Show what the class does with a requested value.
static void Test(FPType x)
{
    try
    {
        IType y = FPToInteger<FPType, IType>(x);
        std::cout << x << " -> " << y << ".\n";
    }
    catch (...)
    {
        std::cout << x << " is not in the domain.\n";
    }
}


#include <cmath>


int main(void)
{
    std::cout.precision(99);

    //  Simple demonstration (not robust testing).
    Test(0);
    Test(0x1p31);
    Test(std::nexttoward(0x1p31, 0));
    Test(-0x1p31-1);
    Test(std::nexttoward(-0x1p31-1, 0));
}

score -1 · Answer 4 · answered Jul 12 '18 at 11:45

-1

Can you not just do

static_cast<F>(static_cast<I>(x)) == floor(x)

?

answered Jul 12 '18 at 11:45

user31601

2,482
1
12
22

First, this should not be an answer, but a comment. And no. If `I` is a signed integer type, then signed integer overflow (which can happen in `static_cast(x)`) is undefined behavior. There is no guarantee, that values of `I` will wrap around like unsigned integer values do. – plasmacel Jul 12 '18 at 11:48
I don't see why the specifics of overflow behaviour are relevant here? We're not interested in _how_ it overflows, just _whether_. If the integral type can't store the floating-point value, then casting to integral and back will surely change the value. – user31601 Jul 12 '18 at 11:54
Literally, undefined behavior can format your hard drive. :] While this behavior has a low probability, the compiler can freely implement signed integer overflow as a runtime error (trap). – plasmacel Jul 12 '18 at 11:55
I expect that, regardless of whet the spec says, a compiler that did that for signed integral overflow would have a bug raised against it pretty quickly. – user31601 Jul 12 '18 at 11:57
The question is clear. This solution doesn't fulfill the requirements of it, since the question asks for a solution which is correct according to the current C++ standard (not a specific compiler implementation, which has raised a bug for a non-conventional behavior or not). – plasmacel Jul 12 '18 at 11:59
4

@user31601: Because integer overflow is undefined, a compiler is free to recognize that `static_cast(static_cast(x))` produces `floor(x)` for all values that do not overflow and to decide that, for the sake of optimization, it may also produce `floor(x)` for values that do overflow. Then the expression `static_cast(static_cast(x)) == floor(x)` is always true, and the compiler would compile it to a hard-coded true. – Eric Postpischil Jul 12 '18 at 12:15
Thanks @EricPostpischil, I understand now. – user31601 Jul 12 '18 at 13:01

Reliable overflow detection of floating-point/integer type conversion

4 Answers4

Linked

Related