Floating point comparison revisited

Question

This topic has come up many times on StackOverflow, but I believe this is a new take. Yes, I have read Bruce Dawson's articles and What Every Computer Scientist Should Know About Floating-Point Arithmetic and this nice answer.

As I understand it, on a typical system there are four basic problems when comparing floating-point numbers for equality:

Floating point calculations are not exact
Whether a-b is "small" depends on the scale of a and b
Whether a-b is "small" depends on the type of a and b (e.g. float, double, long double)
Floating point typically has +-infinity, NaN, and denormalized representations, any of which can interfere with a naïve formulation

This answer -- aka. "the Google approach" -- seems to be popular. It does handle all of the tricky cases. And it does scale the comparison very precisely, checking whether two values are within a fixed number of ULPs of each other. Thus, for example, a very large number compares "almost equal" to infinity.

However:

It is very messy, in my opinion.
It is not particularly portable, relying heavily on internal representations, using a union to read the bits from a float, etc.
It only handles single-precision and double-precision IEEE 754 (in particular, no x86 long double)

I want something similar, but using standard C++ and handling long doubles. By "standard", I mean C++03 if possible and C++11 if necessary.

Here is my attempt.

#include <cmath>
#include <limits>
#include <algorithm>

namespace {
// Local version of frexp() that handles infinities specially.
template<typename T>
T my_frexp(const T num, int *exp)
{
    typedef std::numeric_limits<T> limits;

    // Treat +-infinity as +-(2^max_exponent).
    if (std::abs(num) > limits::max())
    {
        *exp = limits::max_exponent + 1;
        return std::copysign(0.5, num);
    }
    else return std::frexp(num, exp);
}
}

template<typename T>
bool almostEqual(const T a, const T b, const unsigned ulps=4)
{
    // Handle NaN.
    if (std::isnan(a) || std::isnan(b))
        return false;

    typedef std::numeric_limits<T> limits;

    // Handle very small and exactly equal values.
    if (std::abs(a-b) <= ulps * limits::denorm_min())
        return true;

    // frexp() does the wrong thing for zero.  But if we get this far
    // and either number is zero, then the other is too big, so just
    // handle that now.
    if (a == 0 || b == 0)
        return false;

    // Break the numbers into significand and exponent, sorting them by
    // exponent.
    int min_exp, max_exp;
    T min_frac = my_frexp(a, &min_exp);
    T max_frac = my_frexp(b, &max_exp);
    if (min_exp > max_exp)
    {
        std::swap(min_frac, max_frac);
        std::swap(min_exp, max_exp);
    }

    // Convert the smaller to the scale of the larger by adjusting its
    // significand.
    const T scaled_min_frac = std::ldexp(min_frac, min_exp-max_exp);

    // Since the significands are now in the same scale, and the larger
    // is in the range [0.5, 1), 1 ulp is just epsilon/2.
    return std::abs(max_frac-scaled_min_frac) <= ulps * limits::epsilon() / 2;
}

I claim that this code (a) handles all of the relevant cases, (b) does the same thing as the Google implementation for IEEE-754 single- and double-precision, and (c) is perfectly standard C++.

One or more of these claims is almost certainly wrong. I will accept any answer that demonstrates such, preferably with a fix. A good answer should include one or more of:

Specific inputs differing by more than ulps Units in Last Place, but for which this function returns true (the bigger the difference, the better)
Specific inputs differing by up to ulps Units in Last Place, but for which this function returns false (the smaller the difference, the better)
Any case(s) I have missed
Any way in which this code relies on undefined behavior or breaks depending on implementation-defined behavior. (If possible, please cite a relevant spec.)
Fixes for whatever problem(s) you identify
Any way to simplify the code without breaking it

I intend to place a non-trivial bounty on this question.

@Eric: You are right; I am using 2^ulps when I should just use ulps. Fixing now. As for the default, the Google folks disagree with you on whether a default is a good idea... It likely depends on your particular application. I will think it over for mine. Thanks for the comments. — Nemo, Dec 18 '12 at 20:05
The expression `factor * limits::epsilon / 2` converts `factor` to the floating-point type, which causes rounding errors for large values of `factor` that are not exactly representable. Likely, the routine is not intended to be used with such large values (millions of ULPs in `float`), so this ought to be specified as a limit on the routine rather than a bug. (Of course, the shift `1U << ulps` is also subject to error when `ulps` is large, but this is being removed.) — Eric Postpischil, Dec 18 '12 at 20:35
As an example of why a default of 4 ULP is bad, consider `1./49*49-1`. The mathematically exact result is 0, but the computed result (64-bit IEEE 754 binary) is -0x1p-53, an error exceeding 1e307 ULP of the exact result and almost 1e16 ULP of the computed result. — Eric Postpischil, Dec 18 '12 at 21:19
@Eric: So [the POSIX spec for frexp](http://pubs.opengroup.org/onlinepubs/9699919799/functions/frexp.html) is also using "mantissa" incorrectly? And the [Wikipedia definition](http://en.wikipedia.org/wiki/Mantissa) is also wrong? Thanks again for the good comments. — Nemo, Dec 18 '12 at 21:44
@Nemo: “Mantissa” does not appear in the IEEE 754 standard. The Wikipedia page you point to is not a definition; it is a disambiguation page. The page it refers to, about “significand”, has a [usage note](http://en.wikipedia.org/wiki/Significand#Use_of_.22mantissa.22) on “mantissa”. The POSIX standard specifies POSIX; it is not authoritative on floating point or mathematics. — Eric Postpischil, Dec 18 '12 at 21:48
@Eric: The use of "mantissa" is so common in Computer Science that I would argue it is acceptable... But since you are being so helpful -- and, truth be told, since I have an affinity for pedantry myself -- I have changed the comments and renamed the variables. (I do not like any of the obvious abbreviations for "significand", so I went with "_frac". Hope that is acceptable.) — Nemo, Dec 18 '12 at 21:58
Are you sure that you leave the fp status flags as google code would? — aka.nice, Dec 18 '12 at 23:22
@Matt - If you prefer, pretend I titled it "what is wrong with this code"? It may not be phrased as a question, but it is definitely a question, and I think the current title will attract the right kind of people. — Nemo, Dec 18 '12 at 23:43
Re: "Floating point calculations are not exact": this is false. Floating point calculations **are** exact. The typical beginner's problem is the assumption that **conversions** from decimal (typically text) to binary are exact. — Pete Becker, Dec 19 '12 at 14:14
@PeteBecker floating point calculations are only interesting imasmuch they represent calculations with mathematical real numbers, and this representation is not exact. Decimal to binary convrrsions are but a particular case of this phenomenon. — n. m. could be an AI, Nov 11 '13 at 17:02
@n.m. - floating-point math does **not** represent calculations with real numbers, and no amount of hacking will change that. Either you accept the way that floating-point math works on its own merits, or you spend inordinate amounts of time trying to work around a flawed mental model. — Pete Becker, Nov 11 '13 at 17:41
@PeteBecker: what do you mean by "do not represent"? There's a very well defined mapping from real numbers to floating point numbers. That's the "representation". You are free to ignore it if it's not useful to you, but I find it rather convenient. — n. m. could be an AI, Nov 11 '13 at 18:17
@n.m. - maybe I misunderstood what you meant by "represent **calculations** with mathematical real numbers" [emphasis added]. There's a rough similarity, but calculations with floating-point numbers do **not** obey the same rules as real numbers. — Pete Becker, Nov 11 '13 at 19:30
"calculations with floating-point numbers do not obey the same rules as real numbers" -- sure they don't, that's why they are called inexact... — n. m. could be an AI, Nov 11 '13 at 19:44

Eric Postpischil · Answer 1 · 2023-01-09T22:03:30.217

“Almost Equals” Is Not a Good Function

4 is not an appropriate value: The answer you point to states “Therefore, 4 should be enough for ordinary use” but contains no basis for that claim. In fact, there are ordinary situations in which numbers calculated in floating-point by different means may differ by many ULP even though they would be equal if calculated by exact mathematics. Therefore, there should be no default value for the tolerance; each user should be required to supply their own, hopefully based on thorough analysis of their code.

As an example of why a default of 4 ULP is bad, consider 1./49*49-1. The mathematically exact result is 0, but the computed result (64-bit IEEE 754 binary) is 2⁻⁵³, an error exceeding 10³⁰⁷ ULP of the exact result and almost 10¹⁶ ULP of the computed result.

Sometimes, no value is appropriate: In some cases, the tolerance cannot be relative to the values being compared, neither a mathematically exact relative tolerance nor a quantized ULP tolerance. For example, nearly every output value in an FFT is affected by nearly every input value, and the error in any one element is related to the magnitudes of other elements. So the amount of error we might get cannot be computed from the two values being compared; it is a function of other data in the FFT. An “almost equals” routine must be supplied additional context with information about the potential error.

“Almost Equals” has poor mathematical properties: This shows one of the shortcomings of “almost equals”: Scaling changes the results. The code below prints 1 and 0, indicating that 1.1 and 1.1 + 3*0x1p-52 are “almost equal,” but the same values multiplied by .8 are not “almost equal.”

double x0 = 1.1;
double x1 = 1.1 + 3*0x1p-52;
std::cout << almostEqual(x0, x1) << "\n";
x0 *= .8;
x1 *= .8;
std::cout << almostEqual(x0, x1) << "\n";

Another failing is that it is not transitive; almostEqual(a, b) and almostEqual(b, c) does not imply almostEqual(a, c).

A Bug in Extreme Cases

almostEqual(1.f, 1.f/11, 0x745d17) incorrectly returns 1.

1.f/11 is 0x1.745d18p-4 (hexadecimal floating-point notation, meaning 1.745d18₁₆•2⁻⁴). Subtracting this from 1 (which is 0x10p-4) yields 0xe.8ba2e8p-4. Since an ULP of 1 is 0x1p-23, that is 0xe.8ba2e8p19 ULP = 0xe8ba2e.8/2 ULP (shifted 20 bits and divided by 2, netting 19 bits) = 0x745d17.4 ULP. That exceeds the specified tolerance of 0x745d17, so the correct answer would be 0.

This error is caused by rounding in max_frac-scaled_min_frac.

An easy escape from this problem is to specify that ulps must be less than .5/limits::epsilon. Then rounding occurs in max_frac-scaled_min_frac only if the difference (even when rounded) exceeds ulps; if the difference is less than that, the subtraction is exact, by Sterbenz’ Lemma.

There was a suggestion about using long double to correct this. However, long double would not correct this. Consider comparing 1 and -0x1p-149f with ulps set to 1/limits::epsilon. Unless your significand has 149 bits, the subtraction result rounds to 1, which is less than or equal to 1/limits::epsilon ULP. Yet the mathematical difference clearly exceeds 1.

Minor Note

The expression factor * limits::epsilon / 2 converts factor to the floating-point type, which causes rounding errors for large values of factor that are not exactly representable. Likely, the routine is not intended to be used with such large values (millions of ULPs in float), so this ought to be specified as a limit on the routine rather than a bug.

Thank you for taking a close look at this, and nice catch (+1). If this is the only bug, I think I am in pretty good shape. I believe I could also fix this by using `long double` when computing `max_mant-scaled_min_mant`... But restricting the legal range for `ulps` is probably a better idea. — Nemo, Dec 18 '12 at 21:47
I sent you Email but maybe it got filtered... Could you please delete your comments on the question and move them into your answer? They will be much easier to read that way, especially once I make your response the accepted one. Thanks. — Nemo, Dec 27 '12 at 17:04

aka.nice · Answer 2 · 2012-12-18T23:52:00.260

2

Simplification: You could avoid my_frexp by discarding the non finite cases first all-together:

if( ! std::isfinite(a) || ! std::isfinite(b) )
    return a == b;

It seems that isfinite is in C++11 at least

EDIT However, if intention is to have limits::infinity() within 1 ulp of limits::max()
then above simplification does not hold, but shouldn't my_frexp() return limits::max_exponent+1 in *exp, rather than max_exponent+2 ?

edited Dec 18 '12 at 23:52

answered Dec 18 '12 at 23:03

aka.nice

9,100
1
28
40

// non finite floating point are either equal or not, but never almost – aka.nice Dec 18 '12 at 23:07
The questioner says, "for example, a very large number compares "almost equal" to infinity" with tones of apparent approval. – Steve Jessop Dec 18 '12 at 23:33
@SteveJessop ah good point, but then my_frexp should return max_exponent + 1 in *exp, not + 2, that miss-guided me :( – aka.nice Dec 18 '12 at 23:40
@Steve - Indeed, the spirit of this question is to mimic the (apparently popular) Google approach, but portably. I would give up exact mimicry for a major improvement in portability, speed, or readability... But if I can have my cake and eat it too, I will. @aka.nice - `max_exponent + 2` is correct because `frexp` returns a fractional part in the range [0.5, 1), not [1, 2) – Nemo Dec 19 '12 at 00:26
1

@Nemo I understand that, but 2^max_exponent is yet greater than max(), so why 2^(max_exponent+1) ??? It's more than twice max() – aka.nice Dec 19 '12 at 00:29
@aka.nice: Good point (+1). I had the wrong definition of `max_exponent` in my head. Fixed; thank you. – Nemo Dec 19 '12 at 00:36

aka.nice · Answer 3 · 2012-12-19T01:46:04.793

FUTURE PROOFING: If you ever want to extend such comparison to decimal float http://en.wikipedia.org/wiki/Decimal64_floating-point_format in the future, and assuming that ldexp() and frexp() will handle such type with correct radix, then striclty speaking, 0.5 in return std::copysign(0.5, num); should be replaced by T(1)/limits::radix() - or std::ldexp(T(1),-1) or something... (I could not find a convenient constant in std::numeric_limits)

EDIT As Nemo commented, the assumptions that ldexp and frexp would use the correct FLOAT_RADIX are false, they stick with 2...

So a Future Proof portable version should also use:

std::scalbn(x,n) instead of std::ldexp(x,n)
exp=std::ilogb(std::abs(x)),y=std::scalbn(x,-exp) instead of y=frexp(x,&exp)
now that above y in is [1,FLOAT_RADIX) instead of [T(1)/Float_Radix,1), return copysign(T(1),num) instead of 0.5 for infinite case of my_frexp, and test for ulps*limits::epsilon() instead of ulps*epsilon()/2

That also require a standard >= C++11

Good point. But for real future-proofing, I would need to use the C++11 `std::ilogb` and `std::scalb` to do everything in the right radix. Where I work, C++11 is not an option and probably won't be this decade... Still, a perfectly portable C++11 implementation seems possible and worth having around. Want to take a crack at adding it to this answer? — Nemo, Dec 19 '12 at 01:06

2b-t · Answer 4 · 2022-07-23T13:49:12.247

Two disadvantages of your approach are that your implementation relies on several functions from <cmath> which are clearly only working for floating point numbers found in the standard and sadly are currently not guaranteed to be constexpr (While with GCC a constexpr version of your function will compile it won't with most other compilers).

The advantage of the Google approach over yours is that in C++20 you can replace the union (or a reinterpret_cast) in the GoogleTest version with std::bit_cast resulting in a constexpr function. Furthermore by checking for std::numeric_limits<T>::is_iec559 as well as using the newly introduced std::endian you should be able to make sure that the assumed internal representation is correct. By introducing two custom traits you can gain a lot of flexibility which allows you to extend the function to non-standard floating types:

One trait for the determining the unsigned integer of equal length (in my code termed UIntEquiv): This way you can extend the code to arbitrary length floating point numbers by adding a specialisation (e.g. template<> class UIntEquiv<16> and using GCC's __uint128_t or Boost Multiprecision).
Another for the actual floating type (in my code termed FloatTrait) with its sign, mantissa and exponent masks, as well as the representations for infinity and NaN. This way you might extend the comparison to non-C++-standard floating point numbers like a float256 by adding yet another specialisation.

I inserted the code from my Gist below:

#include <bit>
#include <concepts>
#include <cstdint>
#include <iomanip>
#include <iostream>
#include <limits>
#include <type_traits>
#include <utility>
#include <vector>


namespace detail {
  // Trait for excluding incomplete types: See https://stackoverflow.com/a/44229779/9938686
  template <typename T, std::size_t = sizeof(T)>
  consteval std::true_type is_complete(T*) noexcept;

  consteval std::false_type is_complete(...) noexcept;
}

template <typename T>
using is_complete = decltype(detail::is_complete(std::declval<T*>()));

template <typename T>
static constexpr bool is_complete_v = is_complete<T>::value;

namespace detail {

  // Class for determining the corresponding unsigned integer type with equal length to the floating type
  template <std::size_t N>
  class UIntEquiv {
    protected:
      UIntEquiv() = delete;
      UIntEquiv(UIntEquiv const&) = delete;
      UIntEquiv(UIntEquiv&&) = delete;
      UIntEquiv& operator= (UIntEquiv const&) = delete;
      UIntEquiv& operator= (UIntEquiv&&) = delete;

      template<std::size_t M, typename std::enable_if_t<(M==sizeof(std::uint8_t))>* = nullptr>
      static consteval std::uint8_t determineUIntType() noexcept;
      template<std::size_t M, typename std::enable_if_t<(M==sizeof(std::uint16_t))>* = nullptr>
      static consteval std::uint16_t determineUIntType() noexcept;
      template<std::size_t M, typename std::enable_if_t<(M==sizeof(std::uint32_t))>* = nullptr>
      static consteval std::uint32_t determineUIntType() noexcept;
      template<std::size_t M, typename std::enable_if_t<(M==sizeof(std::uint64_t))>* = nullptr>
      static consteval std::uint64_t determineUIntType() noexcept;

    public:
      using type = decltype(determineUIntType<N>());
  };
  
  // You can potentially add specialisation of UIntEquiv for longer unsigned integer types here (e.g. for long double support).
  // e.g. GCC's __uint128_t: https://gcc.gnu.org/onlinedocs/gcc/_005f_005fint128.html
  //      or Boost: https://www.boost.org/doc/libs/1_67_0/libs/multiprecision/doc/html/boost_multiprecision/tut/ints/cpp_int.html
  //
  // template <>
  // class UIntEquiv<sizeof(__uint128_t)> {
  //   public:
  //     using type = __uint128_t;
  // };
  //
  // As long as std::numeric_limits<T> is specialized for the corresponding floating type and your architecture respects IEEE754 and stores
  // your floating point numbers with little-endian the code should compile correctly.
  // Therefore in case you have particular proprietary floating types with a different mantissa and exponent such as 
  // e.g. GCC's __float128: https://gcc.gnu.org/onlinedocs/gcc/Floating-Types.html
  // the fastest solution is probably to specialize the std::numeric_limits<T> trait yourself.
  // Boost should already provide the fully specialized traits: https://www.boost.org/doc/libs/1_65_1/libs/multiprecision/doc/html/boost_multiprecision/tut/floats/float128.html

  template <std::size_t N>
  using UIntEquiv_t = typename UIntEquiv<N>::type;

  // In case your floating type does not respect IEEE754 or is not stored with little endian you will have to specialise the entire 
  // FloatTrait yourself:
  template <typename T>
  class FloatTrait;

  // Specialised trait for floating point number types according to IEEE754 stored with little endian
  template <typename T>
  requires std::is_floating_point_v<T> && std::numeric_limits<T>::is_iec559 && (std::endian::native == std::endian::little)
  class FloatTrait<T> {
    public:
      static constexpr std::size_t number_of_bytes {sizeof(T)};
      static constexpr std::size_t number_of_bits {number_of_bytes*std::numeric_limits<std::uint8_t>::digits};
      using Bytes = UIntEquiv_t<number_of_bytes>;

      static constexpr std::size_t number_of_sign_bits {1};
      static constexpr std::size_t number_of_fraction_bits {std::numeric_limits<T>::digits-1};
      static constexpr std::size_t number_of_exponent_bits {number_of_bits - number_of_sign_bits - number_of_fraction_bits};

      static constexpr Bytes sign_mask {Bytes{1} << (number_of_bits - 1)};
      static constexpr Bytes fraction_mask {~Bytes{0} >> (number_of_exponent_bits + 1)};
      static constexpr Bytes exponent_mask {~(sign_mask | fraction_mask)};

      static constexpr bool isNan(T const t) noexcept {
        auto const bytes {std::bit_cast<Bytes>(t)};
        auto const exponent_bytes {extractExponent(bytes)};
        auto const fraction_bytes {extractFraction(bytes)};
        return (exponent_bytes == exponent_mask) && (fraction_bytes != 0);
      }

      static constexpr bool isPosInf(T const t) noexcept {
        return isPos(t) && isInf(t);
      }

      static constexpr bool isNegInf(T const t) noexcept {
        return isNeg(t) && isInf(t);
      }

      static constexpr bool isNeg(T const t) noexcept {
        auto const bytes {std::bit_cast<Bytes>(t)};
        auto const sign_bytes {extractSign(bytes)};
        return sign_bytes != 0;
      }

      // Optional helper functions
      static constexpr bool isPos(T const t) noexcept {
        auto const bytes {std::bit_cast<Bytes>(t)};
        auto const sign_bytes {extractSign(bytes)};
        return sign_bytes == 0;
      }

      static constexpr bool isInf(T const t) noexcept {
        auto const bytes {std::bit_cast<Bytes>(t)};
        auto const exponent_bytes {extractExponent(bytes)};
        auto const fraction_bytes {extractFraction(bytes)};
        return (exponent_bytes == exponent_mask) && (fraction_bytes == 0);
      }

      static constexpr Bytes extractSign(Bytes const bytes) noexcept {
        return bytes & sign_mask;
      }

      static constexpr Bytes extractExponent(Bytes const bytes) noexcept {
        return bytes & exponent_mask;
      }

      static constexpr Bytes extractFraction(Bytes const bytes) noexcept {
        return bytes & fraction_mask;
      }

    protected:
      FloatTrait() = delete;
      FloatTrait(FloatTrait const&) = delete;
      FloatTrait(FloatTrait&&) = delete;
      FloatTrait& operator= (FloatTrait const&) = delete;
      FloatTrait& operator= (FloatTrait&&) = delete;
  };

  template <typename T>
  requires is_complete_v<FloatTrait<T>>
  class FloatView {
    public:
      using Trait = FloatTrait<T>;
      using Bytes = typename FloatTrait<T>::Bytes;

      explicit constexpr FloatView(T const t) noexcept
        : value{t} {
        return;
      }
      FloatView() = default;
      FloatView(FloatView const&) = default;
      FloatView(FloatView&&) = default;
      FloatView& operator= (FloatView const&) = default;
      FloatView& operator= (FloatView&&) = default;

      constexpr bool isAlmostEqual(FloatView const rhs, std::uint8_t const max_distance = 4) const noexcept {
        if (Trait::isNan(value) || Trait::isNan(rhs.value)) {
          return false;
        } else if (Trait::isNegInf(value) != Trait::isNegInf(rhs.value)) {
          return false;
        } else if (Trait::isPosInf(value) != Trait::isPosInf(rhs.value)) {
          return false;
        }
        return computeDistance(value, rhs.value) <= max_distance;
      }

    protected:
      T value;

      static constexpr Bytes signMagnitudeToBiased(T const t) noexcept {
        auto const b {std::bit_cast<Bytes>(t)};
        if (Trait::isNeg(t)) {
          return ~b + Bytes{1};
        } else {
          return Trait::sign_mask | b;
        }
      }

      static constexpr Bytes computeDistance(T const a, T const b) noexcept {
        auto const biased1 = signMagnitudeToBiased(a);
        auto const biased2 = signMagnitudeToBiased(b);
        return (biased1 >= biased2) ? (biased1 - biased2) : (biased2 - biased1);
      }
  };

}

template <typename T>
constexpr bool isAlmostEqual(T const lhs, T const rhs, std::uint8_t const max_distance = 4) noexcept {
  detail::FloatView<T> const a {lhs};
  detail::FloatView<T> const b {rhs};
  return a.isAlmostEqual(b, max_distance);
}

Try it here!

Floating point comparison revisited

4 Answers4

“Almost Equals” Is Not a Good Function

A Bug in Extreme Cases

Minor Note

Linked

Related