Floating-point arithemtic in C: epsilon comparison

Question

I'm trying to compare values with double precision using epsilon. However, I have a problem - initially I have thought that the difference should be equal to the epsilon, but it isn't. Additionally, when I've tried to check the binary representation using the successive multiplication something strange has happened and I feel confused, therefore I would appreciate your explanation to the problem and comments on my way of thinking

#include <stdio.h>

#define EPSILON 1e-10

void double_equal(double a, double b) {
    printf("a: %.12f, b: %.12f, a - b = %.12f\n", a, b, a - b);
    printf("a: %.12f, b: %.12f, b - a = %.12f\n", a, b, b - a);
    if (a - b < EPSILON) printf("a - b < EPSILON\n");
    if (a - b == EPSILON) printf("a - b == EPSILON\n");
    if (a - b <= EPSILON) printf("a - b <= EPSILON\n");
    if (b - a <= EPSILON) printf("b - a <= EPSILON\n");
}

int main(void) {
    double wit1 = 1.0000000001;
    double wit2 = 1.0;
    double_equal(wit1, wit2);
    return 0;
}

The output is:

a: 1.000000000100, b: 1.000000000000, a - b = 0.000000000100
a: 1.000000000100, b: 1.000000000000, b - a = -0.000000000100
b - a <= EPSILON

Numeric constants in C are declared as doubles if we don't provide "F"/"f" sign right after the number (#define EPSILON 1e-10F), therefore I can't see here the problem of conversion as in this question. Therefore, I have created really simple program for THESE SPECIFIC examples (I know it should include handling converting integral parts to binary numbers).

#include <stdio.h>
#include <math.h>
#include <stdlib.h>

char* convert(double a) {
    char* res = malloc(200);
    int count = 0;
    double integral;
    a = modf(a, &integral);

    if (integral == 1) {
        res[count++] = integral + '0';
        res[count++] = '.';
    } else {
        res[count++] = '0';
        res[count++] = '.';
    }

    while(a != 0 && count < 200) {
        printf("%.100f\n", a);
        a *= 2;
        a = modf(a, &integral);
        if (integral == 1) res[count++] = integral + '0';
        else res[count++] = '0';
    }

    res[count] = '\0';
    return res;
}

int main(void) {
    double wit1 = 1.0000000001;
    double diff = 0.0000000001;
    char* res = convert(wit1);
    char* di = convert(diff);
    printf("this: %s\n", res);
    printf("diff: %s\n", di);
    return 0;
}

Direct output:

this: 1.0000000000000000000000000000000001101101111100111
diff: 0.00000000000000000000000000000000011011011111001101111111011001110101111011110110111011

First question: why there are so many ending zero-ones in the difference? Why do the results after the binary point differ?

However, if we look at the process of calculation and the fractional part, printed out (I'm presenting only the first few lines):

1.0000000001:
0.0000000001000000082740370999090373516082763671875000000000000000000000000000000000000000000000000000
0.0000000002000000165480741998180747032165527343750000000000000000000000000000000000000000000000000000
0.0000000004000000330961483996361494064331054687500000000000000000000000000000000000000000000000000000

0.0000000001:
0.0000000001000000000000000036432197315497741579165547065599639608990401029586791992187500000000000000
0.0000000002000000000000000072864394630995483158331094131199279217980802059173583984375000000000000000
0.0000000004000000000000000145728789261990966316662188262398558435961604118347167968750000000000000000

Second question: why there are so many strange ending numbers? Is this a result of the incapability of the floating-point arithmetic of precisely representing decimal values?

Analyzing the subtraction, I can see, why the result is bigger than the epsilon. I follow the procedure:

Prepare a complement sequence of zero-ones for the sequence to subtract
"Add" the sequences
Subtract the one in the beginning, add it to the rightmost bit

Therefore:

   1.0000000000000000000000000000000001101101111100111
 - 1.0000000000000000000000000000000000000000000000000
               |
              \/
   1.0000000000000000000000000000000001101101111100111
"+"0.1111111111111111111111111111111111111111111111111    
 --------------------------------------------------------
  10.0000000000000000000000000000000001101101111100110    
          |
          \/
   0.0000000000000000000000000000000001101101111100111

Comparing with the calculated value of epsilon:

0.000000000000000000000000000000000110110111110011 0 1111111011001110101111011110110111011
0.000000000000000000000000000000000110110111110011 1

Spaces indicate the difference.

Third question: do I have to worry if I can't compare the value equal to the epsilon? I think that this situation indicates what the interval of tolerance with epsilon has been made for. However, is there anything I should change?

For such investigations, rather than print `double` wirh `"%f"`, use `printf("%a %.16e\n", a, a);` — chux - Reinstate Monica, Aug 27 '21 at 23:15

score 2 · Answer 1 · answered Aug 27 '21 at 23:31

Why do the results after the binary point differ?

Because that is the difference.

Expecting something else comes from thinking 1.0000000001 and 0.0000000001 as double have those 2 values. They do not. Their difference is not 1.0. They have values near those two, each with about 53 binary digits of significance. Their difference is close to the unit in the last place of 1.0000000001.

why there are so many strange ending numbers? Is this a result of the incapability of the floating-point arithmetic of precisely representing decimal values?

Somewhat.
double can encode about 2⁶⁴ different numbers. 1.0000000001 and 0.0000000001 are not in that set. Instead nearby ones are used that look like strange ending numbers.

do I have to worry if I can't compare the value equal to the epsilon? I think that this situation indicates what the interval of tolerance with epsilon has been made for. However, is there anything I should change?

Yes, change use of epsilon. epsilon is useful for the relative difference, not absolute one. Very large consecutive double values are far more than epsilon apart. About 45% of all double, (the small ones) are all less than epsilon in magnitude. Either if (a - b <= EPSILON) printf("a - b <= EPSILON\n"); or if (b - a <= EPSILON) printf("b - a <= EPSILON\n"); will be true for small a, b even though they are trillions of times different in magnitude.

Oversimplification:

if (fabs(a-b) < EPSILON*fabs(a + b)) {
  return values_a_b_are_near_each_other;
}

Eric Postpischil · Answer 2 · 2021-08-27T23:49:09.427

This answer assumes your C implementation uses IEEE-754 binary64, also known as the “double” format for its double type. This is common.

If the C implementation rounds correctly, then double wit1 = 1.0000000001; initializes wit1 to 1.0000000001000000082740370999090373516082763671875. This is because the two representable values nearest 1.0000000001 are 1.000000000099999786229432174877729266881942749023437500000000000000000000000000000000000000000000000 and 1.0000000001000000082740370999090373516082763671875. The latter is chosen since it is closer.

If correctly rounded, the 1e-10 used for EPSILON will produce 0.000000000100000000000000003643219731549774157916554706559963960899040102958679199218750000000000000.

Clearly wit1 - 1 exceeds EPSILON, so the test a - b < EPSILON in double_equal evaluates as false.

First question: why there are so many ending zero-ones in the difference?

Count the number of bits from the first 1 to the last 1. In each number, there are 53. That is because there are 53 bits in the significand of a double. It is a bit of a coincidence your numbers happened to end in a 1 bit. About half the time, the trailing bit is 0, and a quarter of the time, the last two bits are zeros, and so on. However, since there are 53 bits in the significand of a double, there will be exactly 53 bits from the first 1 bit to the last bit that is part of the represented value.

Since your first number starts with a 1 in the integer position, it has at most 52 bits after that. At that point, the number must be rounded to the nearest representable value.

Since your second number is between 2⁻³⁴ and 2⁻³³, its first 1 bit is in the 2⁻³⁴ position, and it can go to the 2⁻⁸⁶ position before it has to be rounded.

Third question: do I have to worry if I can't compare the value equal to the epsilon?

Why do you want to compare to the epsilon? There is no general solution for comparing floating-point numbers that contain errors from previous operations. Whether or not an “epsilon comparison” can or should be used is dependent on the application and the operations and numbers involved.

Floating-point arithemtic in C: epsilon comparison

2 Answers2

Linked