What's the first double that deviates from its corresponding long by delta?

Question

I want to know the first double from 0d upwards that deviates by the long of the "same value" by some delta, say 1e-8. I'm failing here though. I'm trying to do this in C although I usually use managed languages, just in case. Please help.


#include <stdio.h>
#include <limits.h>
#define DELTA 1e-8

int main() {
    double d = 0; // checked, the literal is fine
    long i;
    for (i = 0L; i < LONG_MAX; i++) {
         d=i; // gcc does the cast right, i checked
         if (d-i > DELTA || d-i < -DELTA) {
              printf("%f", d);
              break;
         }
    }
}

I'm guessing that the issue is that d-i casts i to double and therefore d==i and then the difference is always 0. How else can I detect this properly -- I'd prefer fun C casting over comparing strings, which would take forever.

ANSWER: is exactly as we expected. 2^53+1 = 9007199254740993 is the first point of difference according to standard C/UNIX/POSIX tools. Thanks much to pax for his program. And I guess mathematics wins again.

The code above seems to be sweeping through every integer. I would consider a binary search instead. It should converge withing 53 or so iterations, rather than 2^53. — Andrej Panjkov, May 17 '09 at 02:07
Binary search without some a priori knowledge of representations would not work. **All** powers of two are representable exactly as double, so if your binary search happened to walk powers of two, it might completely miss finding the desired point... :-) — R.. GitHub STOP HELPING ICE, Jan 29 '12 at 02:18
possible duplicate of [Which is the first integer that an IEEE 754 float is incapable of representing exactly?](http://stackoverflow.com/questions/3793838/which-is-the-first-integer-that-an-ieee-754-float-is-incapable-of-representing-e) — Andrew Mao, Mar 26 '13 at 17:20
@R Only powers of two up to 2^1023 (inclusive) are representable exactly as double. :-) — gerrit, Mar 13 '17 at 16:47
Does this answer your question? [Which is the first integer that an IEEE 754 float is incapable of representing exactly?](https://stackoverflow.com/questions/3793838/which-is-the-first-integer-that-an-ieee-754-float-is-incapable-of-representing-e) — phuclv, Mar 07 '23 at 02:39
"And I guess mathematics wins again." - mathematics *always* wins, especially over physics :-) — paxdiablo, Mar 07 '23 at 02:39

paxdiablo · Accepted Answer · 2023-03-07T02:26:13.053

Doubles in IEE-754 have a precision of 52 bits which means they can store numbers accurately up to (at least) 2⁵¹.

If your longs are 32-bit, they will only have the (positive) range 0 .. 2³¹ - 1 so there is no 32-bit long that cannot be represented exactly as a double. For a 64-bit long, it will be (roughly) 2⁵² so I'd be starting around there, not at zero.

You can use the following program to detect where the failures start to occur. An earlier version I had relied on the fact that the last digit in a number that continuously doubles follows the sequence {2,4,8,6}. However, I opted eventually to use a known trusted tool (bc) for checking the whole number, not just the last digit.

Keep in mind that this may be affected by the actions of sprintf() rather than the real accuracy of doubles (I don't think so personally since it had no troubles with certain numbers up to 2¹⁴³).

This is the test program I wrote:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main() {
    FILE *fin;
    double d = 3.0; // 2^n + 1 to avoid exact powers of 2.
    int i = 1;
    char ds[1000];
    char tst[1000];

    // Loop forever, rely on break to finish.

    while (1) {
        // Get C version of the double.

        sprintf (ds, "%.0f", d);

        // Get bc version of the double.

        sprintf (tst, "echo '2^%d - 1' | bc >tmpfile", i);
        system(tst);
        fin = fopen ("tmpfile", "r");
        fgets (tst, sizeof (tst), fin);
        fclose (fin);
        tst[strlen (tst) - 1] = '\0';

        // Check them.

        if (strcmp (ds, tst) != 0) {
            printf( "2^%d + 1 <-- bc failure\n", i);
            printf( "   got       [%s]\n", ds);
            printf( "   expected  [%s]\n", tst);
            break;
        }

        // Output for status then move to next.

        printf( "2^%d + 1 = %s\n", i, ds);
        d = (d - 1) * 2 + 1;  // Again, 2^n + 1.
        i++;
    }
}

This keeps going until:

2^49 + 1 = 562949953421313
2^50 + 1 = 1125899906842625
2^51 + 1 = 2251799813685249
2^52 + 1 = 4503599627370497
2^53 + 1 <-- bc failure
   got       [9007199254740992]
   expected  [9007199254740993]

which is roughly about where I'd expect it to fail.

As an aside, I originally used numbers of the form 2ⁿ but that got me all the way up to:

2^136 = 87112285931760246646623899502532662132736
2^137 = 174224571863520493293247799005065324265472
2^138 = 348449143727040986586495598010130648530944
2^139 = 696898287454081973172991196020261297061888
2^140 = 1393796574908163946345982392040522594123776
2^141 = 2787593149816327892691964784081045188247552
2^142 = 5575186299632655785383929568162090376495104
2^143 <-- bc failure
   got       [11150372599265311570767859136324180752990210]
   expected  [11150372599265311570767859136324180752990208]

with the size of a double being 8 bytes (checked with sizeof). It turned out these numbers were of the binary form 1000...000, which can be represented for far longer with doubles. That's when I switched to using 2ⁿ + 1 to get a better bit pattern (one at the start and one at the end).

Now, just to be clear, the most reliable way would be to check every number to see which one fails first, but that's going to have quite a long runtime. This approach is the best one given knowledge of the IEEE-754 encodings.

Concise, and you also figured out why my program would never possibly work. Not just the casting, but rather the fact that long is only 32-bit over here. Maybe C truly is the stone age, and I'm not going back. — Overflown, Apr 09 '09 at 03:59
Thanks a lot for adding to this problem. I figured you have to use strings, there's no other way to test at a precision greater than what you are working with. — Overflown, Apr 09 '09 at 05:43
I got 2^53 + 1 (9007199254740993) cast as a double then cast back to long resulted in a different value 9007199254740992. — karmakaze, Nov 16 '11 at 18:38
@karmakaze: good point, I've modified the program to do `2^n+1` rather than `2^n-1`. That finds it at a lower value. — paxdiablo, Mar 07 '23 at 02:27

score 2 · Answer 2 · answered Apr 09 '09 at 02:59

2

The first long to be 'wrong' when cast to a double will not be off by 1e-8, it will be off by 1. As long as the double can fit the long in its significand, it will represent it accurately.

I forget exactly how many bits a double has for precision vs offset, but that would tell you the max size it could represent. The first long to be wrong should have the binary form 10000..., so you can find it much quicker by starting at 1 and left-shifting.

Wikipedia says 52 bits in the significand, not counting the implicit starting 1. That should mean the first long to be cast to a different value is 2^53.

answered Apr 09 '09 at 02:59

Craig Gidney

17,763
5
68
136

I like the math idea from wikipedia, I was just trying to use evidence. – Overflown Apr 09 '09 at 03:13
Actually, the first long to be wrong will have a bit pattern `100..001`. With `100..000`, you just have to switch up to the next power of two. Having a `1` at both "ends" means you can't do that without losing the least significant `1`. – paxdiablo Mar 07 '23 at 02:42

score 1 · Answer 3 · answered May 17 '09 at 02:10

1

Although I'm hesitant to mention Fortran 95 and successors in this discussion, I'll mention that Fortran since the 1990 standard has offered a SPACING intrinsic function which tells you what the difference between representable REALs are about a given REAL. You could do a binary search on this, stopping when SPACING(X) > DELTA. For compilers that use the same floating point model as the one you are interested in (likely to be the IEEE754 standard), you should get the same results.

answered May 17 '09 at 02:10

Andrej Panjkov

1,448
2
13
17

In C++11 and later we have `double std::nextafter(double x, long double y)` for the next representable number after `x` in direction of `y`. Also, in lower precision in the second argument: `double nextafter (double x, double y );`. That's not quite the same as spacing though. – emsr Dec 14 '16 at 20:55

score 0 · Answer 4 · answered Apr 09 '09 at 02:58

0

Off hand, I thought that doubles could represent all integers (within their bounds) exactly.

If that is not the case, then you're going to want to cast both i and d to something with MORE precision than either of them. Perhaps a long double will work.

answered Apr 09 '09 at 02:58

Bill Lynch

80,138
16
128
173

I guess you mean "integers representable as int" will be exactly representable as doubles. This is true when the number of mantissa digits in a double is greater than the number of digits in the int. It's worth remembering that at high exponent values, the distance between representable floating point numbers can exceed 1, so that not all integers are exactly representable in floating point. – Andrej Panjkov May 17 '09 at 02:05

What's the first double that deviates from its corresponding long by delta?

4 Answers4

Linked

Related