1

I want to judge if two floating-point numbers are equal. The normal operation should be fabs (a - b) < DBL_EPSILON, but when a = pow (2, -100), b = pow (2, -101), the result of the comparison is true, because the result of (a - b) is 3.9443e-31, less then DBL_EPSILON = 2.22045e-16. In fact, these two numbers are not equal, a is twice as much as b. If I compare them with DBL_MIN, they are indeed not equal. Should I do this?

the definition in wiki:
DBL_MIN – minimum normalized positive value of double;
DBL_EPSILON – difference between 1.0 and the next representable value of double.
According to the definition of above, DBL_EPSILON is the minimum precision of double value, why is there DBL_MIN? what is the relationship between DBL_MIN and DBL_EPSILON?

Alan Wang
  • 41
  • 5

2 Answers2

0

Other than both being influenced by the number of bits used for the floating point type, there is no relationship between the two.

DBL_MIN is effectively governed by the number of bits used in the exponent. IEEE 754 double format uses 11 bits for the exponent which gives ~2^-1022 ~= 2.2e-308 for the minimum value. This is exceedingly small.

DBL_EPSILON is effectively governed by the number of bits in the mantissa. IEEE 754 double uses 52 bits giving an epsilon of 2^-52 ~= 2.2e-16

Important DBL_MIN is an absolute value. DBL_EPSILON is a relative value.

Specifically DBL_EPSILON is relative to 1.0. As you noted, if you use DBL_EPSILON as an absolute tolerance then numbers much smaller than the epsilon will be considered equal. This might be what you want since you can think of such small numbers as being zero plus noise - indeed this is the main use for absolute tolerances. This also works the other way. Using a small absolute tolerance with large numbers will mean that the tolerance will never consider any relatively small differences to be within tolerance.

Your functions for comparison with tolerance will depend on many factors

  • the range of values in your domain (e.g., astrophysics doesn't use the same scales as microelectronics)
  • your accuracy needs
  • the amount of numerical error in your calculations

Finally, you may need comparison functions with a relative tolerance, and absolute tolerance or both.

Here is an example (in C++, untested).

bool almostEqual(double a, double b, double relTol=1e-3, double absTolVal=1e-8)
{
    double maxVal = std::max(std::fabs(a), std::fabs(b));
    double relTolVal = maxVal*relTol;
    double diffVal = std::fabs(a - b);
    return diffVal <= relTolVal || diffVal <= absTolVal;
}

A few comments

  • If you don't use max or min then there will be cases where almostEqual(a, b) and almostEqual(b, a) give different results.
  • Whilst you can templatize such a function, this makes choosing the default values of the tolerances more difficult.
  • In my experience (analog circuit simulation), a value between 1e-2 and 1e-4 is good for the relTol. For the absTolVal, a value that is a factor of 1e6 to 1e8 smaller than the largest value in your domain is a good guide. For instance in circuit simulation, typical voltages are around 1V so a value of 1e-6 would be a suitable voltage absolute tolerance. Currents are of a much smaller magnitude, in the region of microamps, so the current absolute tolerance should be something like 1e-12.
  • The code above has a potential overflow if a and b have opposite signs. This can be handled by adding a sign check.
  • NaN may need special treatment.
  • Inf also needs special treatment.

Further Reading There are many articles on this subject. There is the often quoted What Every Computer Scientist Should Know About Floating-Point Arithmetic.
Next a simpler version, What Every Programmer Should Know About Floating-Point Arithmetic.
Third and last, a good discussion with examples.

Paul Floyd
  • 5,530
  • 5
  • 29
  • 43
0

I want to judge if two floating-point numbers are equal.

No, you do not. What you are actually trying to do is test whether two real numbers a and b are equal when all you have is two numbers a and b, where a and b are the results of floating-point operations but a and b are the results of real-number mathematics.

Two floating-point objects compare equal if and only if they represent equal numbers. So, if you were trying to judge whether two floating-point numbers were equal, all that would be necessary is to evaluate a == b. That evaluates to true if and only if a and b are equal. So “comparing floating-point numbers” is easy. But you want to “compare the two real numbers I would have if I were using real-number arithmetic, but I only have floating-point numbers,” and that is not easy.

The normal operation should be fabs (a - b) < DBL_EPSILON,…

No, that is not the normal operation. There is no general solution for comparing floating-point numbers that contain errors from previous operations.. I have written about this previously here, here, and here.

the definition in wiki:

DBL_MIN – minimum normalized positive value of double;

DBL_EPSILON – difference between 1.0 and the next representable value of double.

According to the definition of above, DBL_EPSILON is the minimum precision of double value, why is there DBL_MIN? what is the relationship between DBL_MIN and DBL_EPSILON?

Your question does not state what programming language or implementation you are using, so we do not know precisely what is used for the double type you are using. However, IEEE-754 64-bit binary floating-point is ubiquitous. In this format, numbers are represented with a sign, a 53-bit significand, and an exponent of two from −1022 to +1023. (The significand is encoding using both a 52-bit field and some information from the exponent field, so many people refer to it as a 52-bit significand, but this is incorrect. Only the primary field for encoding it is 52 bits. The actual significand is 53 bits.) This information about the significand with and exponent range is enough to understand DBL_MIN and DBL_EPSILON, so I will not discuss the encoding format much in this answer. However, I will point out there are normal signifcands and subnormal significands. For normal significands, the significand value is given by the binary numeral “1.” followed by 52 bits after the radix point (the 52 bits in the significand field). For subnormal significands, the significand value given by “0.” followed by 52 bits. Normal and subnormal significands are distinguished by the value in the exponent field.

DBL_MIN is the minimum normal positive value. So it has the smallest normal significand value, given by “1.0000000000000000000000000000000000000000000000000000”, which is 1, and the lowest exponent, −1022. So it is +1•2−1022, which is about 2.2•10−308.

DBL_EPSILON is the difference between one and the next value representable in the floating-point format. That next value is given by a significand with binary “1.0000000000000000000000000000000000000000000000000001”, which is 1+2−52. So DBL_EPSILON is 2−52.

Which of these should you use for a tolerance in comparison? Neither. To get a and b, presumably you did some floating-point operation. In each of those operations, there may have been some error. Floating-point arithmetic approximates real arithmetic. For each elementary operation, floating-point arithmetic gives you the representable value that is nearest the real-number result. (Usually, this is the nearest in either direction, but directed rounding modes may be available to choose a preferred direction.) When this representable result differs from the real-number result, the difference is called rounding error. In round-to-nearest mode, the rounding error may, in general, be up to 1/2 the distance between representable numbers in that vicinity.

When you do more than one floating-point operation, these rounding errors compound. They may accumulate or happen to cancel. Each error is small relative to the immediate result, but, as that number is used in further calculations, the final result of the calculations may be small, so errors that occurred during the calculations may be large compared to the final result.

Understanding what the final error may be is a difficult problem in general. There is an entire field of study for it, numerical analysis. What this means is there cannot be any general recommendation about what tolerance to use when attempting to compare floating-point numbers the way you want. It requires study particular to each problem. Furthermore, if you figure out that the floating-point results a and b might be some distance d apart even though the real-number results a and b would be equal, that does not mean comparing a and b with a tolerance of d is the right thing to do. That would ensure you get no false negatives—every time a and b are equal, your comparison of a and b returns true. However, it would allow you to get false positives—sometimes when a and b are not equal, your comparison of a and b returns true.

This is another reason there can be no general advice for comparing floating-point numbers. The first is that the errors that occur are particular to each computation. The second is that eliminating false negatives requires allowing false positives, and whether that is acceptable or not depends on the application. So it cannot be given as general advice.

Eric Postpischil
  • 195,579
  • 13
  • 168
  • 312