Exact value of a floating-point number as a rational

Question

I'm looking for a method to convert the exact value of a floating-point number to a rational quotient of two integers, i.e. a / b, where b is not larger than a specified maximum denominator b_max. If satisfying the condition b <= b_max is impossible, then the result falls back to the best approximation which still satisfies the condition.

Hold on. There are a lot of questions/answers here about the best rational approximation of a truncated real number which is represented as a floating-point number. However I'm interested in the exact value of a floating-point number, which is itself a rational number with a different representation. More specifically, the mathematical set of floating-point numbers is a subset of rational numbers. In case of IEEE 754 binary floating-point standard it is a subset of dyadic rationals. Anyway, any floating-point number can be converted to a rational quotient of two finite precision integers as a / b.

So, for example assuming IEEE 754 single-precision binary floating-point format, the rational equivalent of float f = 1.0f / 3.0f is not 1 / 3, but 11184811 / 33554432. This is the exact value of f, which is a number from the mathematical set of IEEE 754 single-precision binary floating-point numbers.

Based on my experience, traversing (by binary search of) the Stern-Brocot tree is not useful here, since that is more suitable for approximating the value of a floating-point number, when it is interpreted as a truncated real instead of an exact rational.

Possibly, continued fractions are the way to go.

The another problem here is integer overflow. Think about that we want to represent the rational as the quotient of two int32_t, where the maximum denominator b_max = INT32_MAX. We cannot rely on a stopping criterion like b > b_max. So the algorithm must never overflow, or it must detect overflow.

What I found so far is an algorithm from Rosetta Code, which is based on continued fractions, but its source mentions it is "still not quite complete". Some basic tests gave good results, but I cannot confirm its overall correctness and I think it can easily overflow.

// https://rosettacode.org/wiki/Convert_decimal_number_to_rational#C

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <stdint.h>

/* f : number to convert.
 * num, denom: returned parts of the rational.
 * md: max denominator value.  Note that machine floating point number
 *     has a finite resolution (10e-16 ish for 64 bit double), so specifying
 *     a "best match with minimal error" is often wrong, because one can
 *     always just retrieve the significand and return that divided by 
 *     2**52, which is in a sense accurate, but generally not very useful:
 *     1.0/7.0 would be "2573485501354569/18014398509481984", for example.
 */
void rat_approx(double f, int64_t md, int64_t *num, int64_t *denom)
{
    /*  a: continued fraction coefficients. */
    int64_t a, h[3] = { 0, 1, 0 }, k[3] = { 1, 0, 0 };
    int64_t x, d, n = 1;
    int i, neg = 0;

    if (md <= 1) { *denom = 1; *num = (int64_t) f; return; }

    if (f < 0) { neg = 1; f = -f; }

    while (f != floor(f)) { n <<= 1; f *= 2; }
    d = f;

    /* continued fraction and check denominator each step */
    for (i = 0; i < 64; i++) {
        a = n ? d / n : 0;
        if (i && !a) break;

        x = d; d = n; n = x % n;

        x = a;
        if (k[1] * a + k[0] >= md) {
            x = (md - k[0]) / k[1];
            if (x * 2 >= a || k[1] >= md)
                i = 65;
            else
                break;
        }

        h[2] = x * h[1] + h[0]; h[0] = h[1]; h[1] = h[2];
        k[2] = x * k[1] + k[0]; k[0] = k[1]; k[1] = k[2];
    }
    *denom = k[1];
    *num = neg ? -h[1] : h[1];
}

@πάνταῥεῖ I know this article. How does it help here? — plasmacel, Jul 02 '18 at 19:03
Representation of 2 based numbers can only represent a subset of all possible rational fractions. — πάντα ῥεῖ, Jul 02 '18 at 19:04
@πάνταῥεῖ Yes, but that subset still contains valid rational fractions. — plasmacel, Jul 02 '18 at 19:04
Well, if you could have an arbitrary number of bits for the representation you might be able to represent all of them. May be I didn't understand what you actually want to do? — πάντα ῥεῖ, Jul 02 '18 at 19:07
I said "This is the *exact* value of `f`, which is a number from the mathematical set of IEEE 754 single-precision binary floating-point numbers." The mathematical set of floating-point numbers is a subset of rational numbers. All IEEE 754 floating-point numbers can be represented as a quotient of two integers. There is no need for infinite bits. I already wrote a Wolfram Mathematica function to do that. Unfortunately that is not portable to C++. — plasmacel, Jul 02 '18 at 19:07
I guess I don’t understand the problem:get the significand, get the exponent: the exact value is the significand shifted left for positive exponent divided by one and otherwise it is the significand divided by 2 shifted by the negative exponent with some small faffing about to account for the digits of the significand. Obviously you’ll need big integers to avoid overflow. — Dietmar Kühl, Jul 02 '18 at 19:31
You may be interested in Python's [`Fraction.limit_denominator`](https://github.com/python/cpython/blob/087570af6d5d39b51bdd5e660a53903960e58678/Lib/fractions.py#L219), which does exactly this. Of course, that doesn't help with the overflow problem in C, but the algorithm may be useful. Specifically, in Python, for a (Python) `float` `x`, `Fraction(x).limit_denominator(b_max)` gives what you want. — Mark Dickinson, Jul 02 '18 at 19:31
@DietmarKühl The significand in IEEE 754 is encoded as a binary fraction. Converting that binary fraction to decimal naively is a complete disaster because you have to add `k` rationals with different denominators, i.e. you have to perform a lot of multiplication and addition (not to mention applying the exponent at the end), which leads to overflow very very soon. Of course you could simplify in each step using the greatest common divisor, but that would make the algorithm very slow. This approach also cannot provide the closest approx if the rational cannot be represented as `b <= b_max`. — plasmacel, Jul 02 '18 at 19:37
The significand is an integer. You’ll only need to multiply it by a power of 2 to get the exact value. Clearly, you will need big integers to prevent overflow to represent values with interesting exponents. Converting these to a decimal representation is a separate question and can be done by appropriate divisions by suitable powers of 10 (you can keep using 10 if you are so inclined). Since you asked for an algorithm: that’ll do the trick. Since floating points are fixed sized you don’t need a cariable size implementation which ,akes things a bit more effective. — Dietmar Kühl, Jul 02 '18 at 19:43
The greatsst common divisor won’t help you much as the denominator is a power of 2, It will at best shift off a few zero bits. You’ll need a large integer representation to get the exact value. — Dietmar Kühl, Jul 02 '18 at 19:47
@DietmarKühl There is no guarantee that the integers used to represent the quotient `a / b` will be arbitrary-precision. If they are not, then I am interested in the best approximation which still satisfies `b <= b_max`, where `b_max` is the largest value of that integer type. This is why an iterative refinement algorithm is desirable. — plasmacel, Jul 02 '18 at 19:50
You should focus on the algorithm IMO. Precision of calculations can be extended( at the cost of memory and runtime) throughout library stuff such as `boost::multiprecision`. — Red.Wave, Jul 02 '18 at 19:32
chux has given an answer that may be suitable for IEEE-754 binary formats (I have not examined it yet). For completeness, I will point out that C provides `FLT_RADIX`, which defines the radix used for the floating-point formats. Multiplying or dividing by `FLT_RADIX` should be exact (and note the `scalbn` function), so, given some floating-point value `x` that is not an integer, one can find the desired numerator by multiplying by `FLT_RADIX` until the result is an integer, and the denominator is `FLT_RADIX` to the power of the number of multiplications. — Eric Postpischil, Jul 03 '18 at 01:21
@EricPostpischil Great approach, but what about denormal numbers? Will this work for them as well? You can also encounter too big numbers during the multiplication by `FLT_RADIX`. — plasmacel, Jul 03 '18 at 06:37
@plasmacel: Multiplying denormal numbers should work fine. Multiplying by `FLT_RADIX` until a number is an integer will not overflow in any normal format. Theoretically, one could define a floating-point format such that the exponent is limited to a weird range like -20 to -10 instead of a more normal -126 to 127, but, in practice, exponent ranges allow the significand to be scaled to an integer. — Eric Postpischil, Jul 03 '18 at 11:55
@EricPostpischil I guess this method also guarantees the canonical form of the rational, i.e. it is cannot be further simplified by the greatest common divisor since the GCD is 1. — plasmacel, Jul 09 '18 at 08:31
@EricPostpischil At least if the radix is a prime number. Otherwise the computed numerator and denominator could be simplified by the greatest common divisor. Think about radix 10. — plasmacel, Jul 09 '18 at 15:33

score 7 · Accepted Answer · edited Jul 08 '18 at 23:43

7

All finite double are rational numbers as OP well stated..

Use frexp() to break the number into its fraction and exponent. The end result still needs to use double to represent whole number values due to range requirements. Some numbers are too small, (x smaller than 1.0/(2.0,DBL_MAX_EXP)) and infinity, not-a-number are issues.

The frexp functions break a floating-point number into a normalized fraction and an integral power of 2. ... interval [1/2, 1) or zero ...
C11 §7.12.6.4 2/3

#include <math.h>
#include <float.h>

_Static_assert(FLT_RADIX == 2, "TBD code for non-binary FP");

// Return error flag
int split(double x, double *numerator, double *denominator) {
  if (!isfinite(x)) {
    *numerator = *denominator = 0.0;
    if (x > 0.0) *numerator = 1.0;
    if (x < 0.0) *numerator = -1.0;
    return 1;
  }
  int bdigits = DBL_MANT_DIG;
  int expo;
  *denominator = 1.0;
  *numerator = frexp(x, &expo) * pow(2.0, bdigits);
  expo -= bdigits;
  if (expo > 0) {
    *numerator *= pow(2.0, expo);
  }
  else if (expo < 0) {
    expo = -expo;
    if (expo >= DBL_MAX_EXP-1) {
      *numerator /= pow(2.0, expo - (DBL_MAX_EXP-1));
      *denominator *= pow(2.0, DBL_MAX_EXP-1);
      return fabs(*numerator) < 1.0;
    } else {
      *denominator *= pow(2.0, expo);
    }
  }

  while (*numerator && fmod(*numerator,2) == 0 && fmod(*denominator,2) == 0) {
    *numerator /= 2.0;
    *denominator /= 2.0;
  }
  return 0;
}

void split_test(double x) {
  double numerator, denominator;
  int err = split(x, &numerator, &denominator);
  printf("e:%d x:%24.17g n:%24.17g d:%24.17g q:%24.17g\n", 
      err, x, numerator, denominator, numerator/ denominator);
}

int main(void) {
  volatile float third = 1.0f/3.0f;
  split_test(third);
  split_test(0.0);
  split_test(0.5);
  split_test(1.0);
  split_test(2.0);
  split_test(1.0/7);
  split_test(DBL_TRUE_MIN);
  split_test(DBL_MIN);
  split_test(DBL_MAX);
  return 0;
}

Output

e:0 x:      0.3333333432674408 n:                11184811 d:                33554432 q:      0.3333333432674408
e:0 x:                       0 n:                       0 d:        9007199254740992 q:                       0
e:0 x:                       1 n:                       1 d:                       1 q:                       1
e:0 x:                     0.5 n:                       1 d:                       2 q:                     0.5
e:0 x:                       1 n:                       1 d:                       1 q:                       1
e:0 x:                       2 n:                       2 d:                       1 q:                       2
e:0 x:     0.14285714285714285 n:        2573485501354569 d:       18014398509481984 q:     0.14285714285714285
e:1 x: 4.9406564584124654e-324 n:  4.4408920985006262e-16 d: 8.9884656743115795e+307 q: 4.9406564584124654e-324
e:0 x: 2.2250738585072014e-308 n:                       2 d: 8.9884656743115795e+307 q: 2.2250738585072014e-308
e:0 x: 1.7976931348623157e+308 n: 1.7976931348623157e+308 d:                       1 q: 1.7976931348623157e+308

Leave the b_max consideration for later.

More expedient code is possible with replacing pow(2.0, expo) with ldexp(1, expo) @gammatester or exp2(expo) @Bob__

while (*numerator && fmod(*numerator,2) == 0 && fmod(*denominator,2) == 0) could also use some performance improvements. But first, let us get the functionality as needed.

edited Jul 08 '18 at 23:43

plasmacel

8,183
7
53
101

answered Jul 02 '18 at 19:44

chux - Reinstate Monica

143,097
13
135
256

Nice! I do some tests soon before I accept it. Do you have any idea how to handle "too small" values? – plasmacel Jul 02 '18 at 19:57
@plasmacel The limitations and even type of `b_max` warrant further thought. To more precisely render an integer _type_ solution, I'd start with `intmax_t num, uintmax_t den;` for potentially some more range. – chux - Reinstate Monica Jul 02 '18 at 19:59
@plasmacel "too small" --> I think the question could change to "integer / power-of-2" and then range issues are far less of a concern (except then how to code infinity and NaN, hmmm) . – chux - Reinstate Monica Jul 02 '18 at 20:02
Those too small values, more precisely `1.0 / pow(2, DBL_MAX_EXP)` are actually denormal I think. – plasmacel Jul 02 '18 at 20:04
@plasmacel "too small values" are de-normal (or sub-normal) for [binary64](https://en.wikipedia.org/wiki/Double-precision_floating-point_format), yet this answer does not assume that format as C does not specify that. It is just the most common format and is true there. – chux - Reinstate Monica Jul 02 '18 at 20:06
Is it possible for https://en.cppreference.com/w/c/numeric/math/exp2 to be implemented more efficiently or with more accuracy then a generic `pow(2.0,x)`? – Bob__ Jul 02 '18 at 20:15
@gammatester `ldexp()` would be useful for the inverse operation, merging the whole number numerator and the log2(denominator) to get back the original `x`. – chux - Reinstate Monica Jul 02 '18 at 20:24
`pow(2, exp) == ldexp(1, exp)` – gammatester Jul 02 '18 at 20:26
@Bob__ Yes `exp2()` would be better in general. A good compile will analyze code such as `pow(2.0, expo)` and emit the same code as `exp2(expo)`. – chux - Reinstate Monica Jul 02 '18 at 20:33
@gammatester Yes, good idea to use `ldexp(1.0, expo)` for `pow(2, expo)`. – chux - Reinstate Monica Jul 02 '18 at 20:40
@chux If I understand well, numbers smaller than `1.0 / pow(2, MAX_EXP)` (which are "too small") result in a numerator which is a negative power of `2.0`. Such cases could be resolved by multiplying both the numerator and denominator by `1.0 / numerator`. Or do that cases already have the maximum representable denominator? – plasmacel Jul 02 '18 at 20:48
@plasmacel Yes. Using `double` for the denominator, it can only get so big before "infinity". Thus the test `if (expo >= DBL_MAX_EXP-1)`. Perhaps this code could be refined a bit here, but the point being at some point, `denominator` is limited to how high it can go. Yes, we hit the maximum representable power-of-2 denominator. – chux - Reinstate Monica Jul 02 '18 at 20:51
@chux Then this is not a real issue here. I assume that if a fixed-precision integer type is used instead of an arbitrary-precision, then converting this double to that type (clamping to the maximum value of the fixed integer type) gives the best possible approximation for that type. If an arbitrary-precision integer type is used instead, then a workaround can be easily made for this issue. – plasmacel Jul 02 '18 at 20:56
@chux Is my presumption true that any floating-point number can be represented by a quotient of two floating-point numbers of the same type? And an another optimization idea: wouldn't be division by the greatest common divisor (GCD, implemented by the Euclidean algorithm) be more efficient instead of the while loop at the end? I think it would converge faster. – plasmacel Jul 02 '18 at 21:20
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/174194/discussion-between-chux-and-plasmacel). – chux - Reinstate Monica Jul 02 '18 at 21:41
In C++ code, I suggest replacing DBL_MANT_DIG by std::numeric_limits::digits and DBL_MAX_EXP by std::numeric_limits::max_exponent – Tomáš Kolárik Jan 22 '23 at 14:51
In C++ code, consider using `boost::rational`, which normalizes numerator and denominator within its constructor (the double values must be explicitly converted to `long`, though), thus there is no need for `while (*numerator && ...` – Tomáš Kolárik Jan 22 '23 at 14:54
@TomášKolárik Although `std::numeric_limits::digits` would make for a better C++ answer, OP, using `<*.h>` files seems more interested in a C-like answer. Feel open to post a more idiomatic C++ answer. `boost::` is not part of standard C++ AFAIK, so its inclusion drifts way from standard C++. Again, you could answer with your keen knowledge of `boost`. – chux - Reinstate Monica Jan 23 '23 at 04:43

Exact value of a floating-point number as a rational

1 Answers1

Linked