Get the digits of a long double

Question

I am trying to implement a simple version of a function similar to dtoa but without arbitrary precision. I name it ftob which also handles arithmetic bases other than 10 (2-36). It works with long double, which on my machine is : x86 extended pricision

The function works fine, but on certain values it gives erroneous and unacceptable results, e.g. 2.5600 ,

here is my code:

#include<stdio.h>
#include<math.h>

char *ftob(long double ld, char *str, unsigned short int n_digits, unsigned short int base)
{
    long double ftob_tmp;
    short unsigned index = 1;
    short const sign = (ld < 0.0L) ? -1 : 1;
    short int i = 0, j = 0 , k = 0;

    //check base, number.
    if(base < 2 || base > 36)
        return NULL;
    else if(ld == INFINITY) {
        str = "inf";
        return str;
    }
    else if(ld == -INFINITY) {
        str = "-inf";
        return str;
    }
    else if(__isnanl(ld)) {
        str = "nan";
        return str;
    }
    //initialisations
    (sign == -1) ? str[i++] = '-' : 0;
    ftob_tmp = sign * ld;
    while(ftob_tmp > 0.0L && ftob_tmp < 1.0L) {
        ftob_tmp *= base;
        j++;
    }
    while(ftob_tmp >= base) {
        ftob_tmp /= base;
        j--;
    }
    //reinitialise
    ftob_tmp = sign * ld;
    if(ftob_tmp >= 0.0L && ftob_tmp < 1.0L) {
        str[i++] = '0';
        str[i++] = '.';
        for(k = 0; k < j - 1 && k < n_digits - 1; k++)
            str[i++] = '0';
        n_digits -= j;
    }
    else if(ftob_tmp >= base)
        k = i - j + 1;
    else
        k = i + 1;
    ftob_tmp *= powl(base, --j);
//  printf("%0.20Lf\n", ftob_tmp); /*debug message*/

    //main loop
    for(n_digits += i; i < n_digits; i++) {
        if(i == k)
            str[n_digits++, i] = '.';
        else {
//          printf("%0.20Lf * %Lf = %0.20Lf\n", ftob_tmp, powl(base, index), ftob_tmp * powl(base, index)); /* debug message*/
            str[i] = (int)fmodl((ftob_tmp * powl(base, index++)), base);
            str[i] += (str[i] < 10) ? '0' : 'A' - 10;
        }
    }

    //finalise
    str[i] = '\0';

    return str;
}

int main(void)
{
    char ftl[300];

    printf("ftl = \"%s\"\n", ftob(2.56L, ftl, 19, 10));
    return 0;
}

the output for ftob(2.56L, ftl, 19, 10) is:

ftl = "2.550990990990990990"

Uncommenting the debug message gives:

0.25599999999999999999
0.25599999999999999999 * 10.000000 = 2.55999999999999999995
0.25599999999999999999 * 100.000000 = 25.59999999999999999861
0.25599999999999999999 * 1000.000000 = 255.99999999999999998612
0.25599999999999999999 * 10000.000000 = 2560.00000000000000000000
0.25599999999999999999 * 100000.000000 = 25599.99999999999999822364
0.25599999999999999999 * 1000000.000000 = 255999.99999999999998578915
0.25599999999999999999 * 10000000.000000 = 2560000.00000000000000000000
0.25599999999999999999 * 100000000.000000 = 25599999.99999999999818101060
0.25599999999999999999 * 1000000000.000000 = 255999999.99999999998544808477
0.25599999999999999999 * 10000000000.000000 = 2560000000.00000000000000000000
0.25599999999999999999 * 100000000000.000000 = 25599999999.99999999813735485077
0.25599999999999999999 * 1000000000000.000000 = 255999999999.99999998509883880615
0.25599999999999999999 * 10000000000000.000000 = 2560000000000.00000000000000000000
0.25599999999999999999 * 100000000000000.000000 = 25599999999999.99999809265136718750
0.25599999999999999999 * 1000000000000000.000000 = 255999999999999.99998474121093750000
0.25599999999999999999 * 10000000000000000.000000 = 2560000000000000.00000000000000000000
0.25599999999999999999 * 100000000000000000.000000 = 25599999999999999.99804687500000000000
0.25599999999999999999 * 1000000000000000000.000000 = 255999999999999999.98437500000000000000
0.25599999999999999999 * 10000000000000000000.000000 = 2560000000000000000.00000000000000000000
ftl = "2.550990990990990990"

The source of the error seems to be that 0.256 cannot be represented exactly in long double and has a value about 0.255999999999999999989374818710.
But i am okay if I get output:

flt = "2.5599999999999999999"

instead of :

flt = "2.5600000000000000000"

The problem is that in the "main loop" in the fourth round it is arbitrarily rounded to 2560.00000 which causes str[i] to be set to 0 instead of 9. This is also because 2559.99999999999999... cannot be represented in long double.
But I only need '2559' to be representable so str[i] can be set to 9. (and so for every round in the loop).

I request advice on how I can achieve this, or if it is achievable at all.

Thanks in Advance,

Lines like `str = "inf"; return str;` should be `return strcpy(str, "inf");` if you want to actually place the contents of the string literal in the destination buffer provided. — Oka, Jun 15 '21 at 16:48
If, as you say, `0.256` cannot be exactly represented, then it is not reasonable to suppose that it *can* be recovered, and not some other nearby value that also cannot be exactly represented. Floating point representations are a kind of bargain: sacrifice precision for range. — Weather Vane, Jun 15 '21 at 17:52
@Clifford: “Without arbitrary precision” means the code is written with arithmetic precision fixed at compile time or earlier, rather than code that can dynamically allocate and use memory as needed to perform arithmetic to a precision determined at run time. — Eric Postpischil, Jun 15 '21 at 20:19
@EricPostpischil : I know what it means - it what rhetorical; I don't see how the term applies to this question. I was asking what he OP thinks it means in this context - you giving the text book definition does not help clarify. The precision of `long double` is not arbitrary, it is fixed. In this case it is clearly using the 80-bit x86 FPU register length despite occupying 16 bytes (128 bits). Using `__float128` forces true 128bit FP. There are only 18 significant digits in this `long double`. Extracting more serves no purpose. — Clifford, Jun 15 '21 at 20:45
@Clifford I believe "similar to dtoa but without arbitrary precision" meant, "I am trying to write something like the well-known dtoa code, but without using arbitrary precision, as it does". See [this question](https://stackoverflow.com/questions/3173056/). — Steve Summit, Jun 15 '21 at 21:04
@SteveSummit thanks. Not that "well known" clearly ;-). Better perhaps to just state requirements. — Clifford, Jun 15 '21 at 21:49
@Clifford: One approach to writing code to convert between bases is simple; it is taught in elementary school. One does arithmetic in one of the bases (e.g., binary in a computer) and multiplies/divides by the other base to interpret or extract digits of it. The cost for this algorithmic simplicity is one has to accept whatever number of digits the algorithm involves (more than just the significant digits when floating-point exponents are involved). That requires either arbitrary precision or precalculating a large amount of digits that may be required to span all possibilities… — Eric Postpischil, Jun 15 '21 at 23:40
… The other approach is to use advanced algorithms like Dragon4 or the Grisu algorithms. These are still the subject of ongoing research after decades. They are fast and use limited precision and take some work to implement. (They take some work just to read and understand the papers presenting them.) So, in this context, a request to do the conversion without arbitrary precision is a request to use some advanced algorithm. Which is generally out of the scope of Stack Overflow, due to the amount of explanation required. — Eric Postpischil, Jun 15 '21 at 23:43
@EricPostpischil I appreciate the effort, but you may have over explained that. Perhaps if I were familiar with the dtoa() mentioned but would have been clear. I cannot find documentation of such a function. I am not sure it suffices as an explanation of what the code is intended to do. The title "get digits of long double" seems clear, and long double is finite, so I see no need for arbitrary precision. The number of useful digits in any base is calculable (as in my second answer). — Clifford, Jun 16 '21 at 06:17
@Clifford: `dtoa` is a [Netlib](http://www.netlib.org/fp/index.html) routine that does binary (floating-point) to decimal conversion, by David Gay of the well-known [seminal paper](http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.31.4049) [*Correctly Rounded Binary-Decimal and Decimal-Binary Conversions*](https://ampl.com/REFS/rounding.pdf). As I noted, while a fixed number of digits can be calculated even for the easy algorithm, this is wasteful; implementing an efficient algorithm is hard, and the limited calculations in your second answer are insufficient. — Eric Postpischil, Jun 16 '21 at 08:56
@EricPostpischil you keep saying "well known" :-). I am sure the answer is flawed, but your example does not demonstrate a flaw, and is an artefact if the true precision of the implementation being 80 bit. I am struggling to see a purpose in rendering more digits than the precision justifies. Probably off topic however. It is clearly a matter of understanding the requirements and to do that perhaps knowledge of dtoa is required which I neither have nor have cause to acquire. I'll shut up; it is clearly not in my domain of expertise. — Clifford, Jun 16 '21 at 10:25
@Clifford The bottom line, I think, is that floating-point base conversions are a "hard" problem. Not "hard" as in "NP-complete", but rather, as in "looks easy, everyone imagines they can do it pretty straightforwardly, but doing it *correctly* is anything but straightforward". For most of us, it's pick two or three of: (1) easy to write and understand, (2) delivers correctly-rounded result in *every* case, (3) uses fixed precision, and (4) is maximally efficient. To simultaneously achieve all four is nearly impossible, and is what makes a *magnum opus* out of Dragon and dtoa. — Steve Summit, Jun 16 '21 at 11:23
As I understand it, the real killer is (2). It's not *too* hard to write some float-to-string or string-to-float code that delivers decently-rounded results most of the time, but doing so *all* the time is surprisingly hard, and even more surprisingly can end up requiring arbitrary precision. — Steve Summit, Jun 16 '21 at 11:26
@Clifford My advice to you is: know what you don't know. :-) If you don't understand Dragon and/or dtoa in detail (and I certainly don't!), there's no shame in that, *but* you're almost certainly not capable of writing one of these functions that achieves (2). There's a decent explanation of some of the problems [here](https://www.exploringbinary.com/quick-and-dirty-decimal-to-floating-point-conversion/) and [here](https://www.exploringbinary.com/quick-and-dirty-floating-point-to-decimal-conversion/). — Steve Summit, Jun 16 '21 at 11:32
@Steve Summit, I too don't understand dtoa in detail, I only know that it converts a double to a string and that's about it. as you say in (2), I think I just hit upon a value that didn't go well with my simple implementation. — Siddharth Bhat, Jun 17 '21 at 12:40

chux - Reinstate Monica · Accepted Answer · 2021-06-15T21:16:27.463

Rounding error amplified by mod

ftob_tmp * powl(...) product likely needs to round to nearest long double and so is not the exact math result. This rounded product is then modded and sometimes returns 0 or 9 as it is on or just under an integer value for the later digits of 0.255999999999999999999.

//                  v- rounding introduced error -v
str[i] = (int)fmodl((ftob_tmp * powl(base, index++)), base);
//            ^-- error magnified -----------------^

With more debug info, one can see the sometimes 0, sometimes 9 when only 9 was expected.

printf("bbb %0.20Lf * %Lf = %0.20Lf  %d\n", 
    ftob_tmp, powl(base, index), ftob_tmp * powl(base, index), 
    (int) fmodl((ftob_tmp * powl(base, index++)), base));

bbb 0.25599999999999999999 * 100.000000 = 25.59999999999999999861  2
bbb 0.25599999999999999999 * 10000.000000 = 2560.00000000000000000000  5
bbb 0.25599999999999999999 * 1000000.000000 = 255999.99999999999998578915  9
bbb 0.25599999999999999999 * 100000000.000000 = 25599999.99999999999818101060  0
bbb 0.25599999999999999999 * 10000000000.000000 = 2560000000.00000000000000000000  9
bbb 0.25599999999999999999 * 1000000000000.000000 = 255999999999.99999998509883880615  9
bbb 0.25599999999999999999 * 100000000000000.000000 = 25599999999999.99999809265136718750  0
bbb 0.25599999999999999999 * 10000000000000000.000000 = 2560000000000000.00000000000000000000  9
...

how I can achieve this, or if it is achievable at all (?)

Yes, achievable but not with OP's approach as too much error is injected in various steps. These corner cases are quite difficult and usual oblige an wide or extended integer computation instead of floating point.

Example code to print a double in base 10 exactly may help.

Other lesser issues

More rounding error

The loops with ftob_tmp *= base and ftob_tmp /= base each inject up to a 0.5 ULP error. These loops can then form an off-by-one j calculation.

-0.0

Test the sign, not the value, else -0.0 will print as 0.0.

// sign = (ld < 0.0L) ? -1 : 1;
sign = signbit(ld) ? -1 : 1;

String size

char ftl[300]; is insufficient for LDBL_MAX in base 2. Look to LDBL_MAX_EXP, LDBL_MIN_EXP to help determine minimum max string size.

Get the digits of a long double

1 Answers1