1

I need to subtract extremely small double number x from 1 i.e. to calculate 1-x in C++ for 0<x<1e-16. Because of machine precision restrictions for small enoughf x I will always get 1-x=1. Simple solution is to switch from double to some more precise format like long. But because of some restrictions I can't switch to more precise number formats.

What is the most efficient way to get accurate value of 1-x where x is an extremely small double if I can't use more precise formats and I need to store the result of the subtraction as a double? In practice I would like to avoid percentage errors greater then 1% (between double representation of 1-x and its actual value).

P.S. I am using Rcpp to calculate the quantiles of standard normal distribution via qnorm function. This function is symmetric around 0.5 being much more accurate for values close to 0. Therefore instead of qnorm(1-(1e-30)) I would like to calculate -qnorm(1e-30) but to derive 1e-30 from 1-(1e-30) I need to deal with a precision problem. The restriction on double is due to the fact that as I know it is not safe to use more precise numeric formats in Rcpp. Note that my inputs to qnorm could be sought of exogeneous so I can't to derive 1-x from x durning some preliminary calculations.

Bogdan
  • 864
  • 10
  • 18
  • 3
    "But because of some restrictions I can't switch to more precise number formats." -- Would it be possible for you to use an [arbitrary precision arithmetic library](https://en.wikipedia.org/wiki/List_of_arbitrary-precision_arithmetic_software#Libraries), such as [GMP](https://en.wikipedia.org/wiki/GNU_Multiple_Precision_Arithmetic_Library)? Or do you have to use `double`? – Andreas Wenzel Apr 18 '22 at 14:06
  • 4
    Well, you seem to be asking how to store data with more precision without using a data type that allows for more precision to be stored. You obviously can't have both. – Igor Tandetnik Apr 18 '22 at 14:09
  • 1
    How and what you do to get round this would depend, to a great extent, on what you want to do with the resultant `1-x` value. There *may* be a way to rearrange your algebra to avoid having to actually use such a number. Also, even if you could use a `long double` or other extended-precision format, there will likely be cases that still fall foul of machine limitations. – Adrian Mole Apr 18 '22 at 14:10
  • 1
    The relative error between `(1-x)` and `1` is within 1% whenever `x < 0.01` . So it sounds like your result is already within your accepted tolerance. – Igor Tandetnik Apr 18 '22 at 14:12
  • If what you really want is to compute `1-x1-x2-x3-x4...`, where `x1,x2,x3,x4...` are very small values, you should just compute `1-(x1+x2+x3+x4...)`. In that case, precision errors do not accumulate. – prapin Apr 18 '22 at 14:13
  • Make the numbers bigger then. – Taekahn Apr 18 '22 at 14:16
  • 2
    Find a way to rearrange your formula. There are many "schoolbook" mathematics formulas that will not work correctly on a machine that works with binary (like your computer), thus alternatives have been developed that are more friendly to binary. [Here is an example of the quadratic formula](https://www.johndcook.com/blog/2018/04/28/quadratic-formula/) – PaulMcKenzie Apr 18 '22 at 14:19
  • 1
    Awesome comments in just 14 minutes. Rcpp angle here: you are somewhat constrained to using `double` as that is what R uses. Also note that `qnorm()` has arguments for tail values, and using logs. – Dirk Eddelbuettel Apr 18 '22 at 14:20
  • `double` is what *everything* uses. – Blindy Apr 18 '22 at 14:25
  • So you have a `double` argument as external input and it can be an *epsilon* smaller than `1`. Then the precision is already lost before you start. – Sebastian Apr 18 '22 at 16:25

2 Answers2

3

Simple solution is to switch from double to some more precise format like long [presumably, double]

In that case you have no solution. long double is an alias for double on all modern machines. I stand corrected, gcc and icc still support it, only cl has dropped support for it for a long time.

So you have two solutions, and they're not mutually exclusive:

  1. Use an arbitrary precision library instead of the built-in types. They're orders of magnitude slower, but if that's the best your algorithm can work with then that's that.

  2. Use a better algorithm, or at least rearrange your equation variables, to not have this need in the first place. Use distribution and cancellation rules to avoid the problem entirely. Without a more in depth description of your problem we can't help you, but I can tell you with certainty that double is more than enough to allow us to model airplane AI and flight parameters anywhere in the world.

Blindy
  • 65,249
  • 10
  • 91
  • 131
  • 1
    *long double is an alias for double on all modern machines.* Really? [For example](https://stackoverflow.com/a/14221894/10871073). – Adrian Mole Apr 18 '22 at 14:25
  • *In that case you have no solution. long double is an alias for double on all modern machines.* Is it? https://godbolt.org/z/ex6b39xfq – NathanOliver Apr 18 '22 at 14:27
  • ... and there are numerous software/library implementations of extended-precision `double double` and such like. – Adrian Mole Apr 18 '22 at 14:28
  • Perhaps a more precise statement would be that [Visual C++ aliases long double to double.](https://godbolt.org/z/3M14bnzre) – Blindy Apr 18 '22 at 14:30
  • 1
    Precision is important in Stack Overflow answers (forgive the pun). – Adrian Mole Apr 18 '22 at 14:30
  • Yeah I was honestly unaware that `gcc` still had it, I've been using the MS stack exclusively in my work. – Blindy Apr 18 '22 at 14:32
  • Yeah I mentioned it in my edit. Only `cl` dropped support for it in favor of SSE*/AVX. – Blindy Apr 18 '22 at 14:35
  • As for "real-world" cases where `double` isn't good enough: Think deep-zoom Mandelbrot Sets (and other fractal displays). – Adrian Mole Apr 18 '22 at 14:35
  • There are plenty of ways to need it, my point was that, without any additional information from OP and the law of averages, his case is probably not one of those. The precision requirements are *probably* unoptimized equations. – Blindy Apr 18 '22 at 14:38
2

Rather than resorting to an arbitrary precision solution (which, as others have said, would potentially be extremely slow), you could simply create a class that extends the inherent precision of the double type by a factor of (approximately) two. You would then only need to implement the operations that you actually need: in your case, this may only be subtraction (and possibly addition), which are both reasonably easy to achieve. Such code will still be considerably slower than using native types, but likely much faster than libraries that use unnecessary precision.

Such an implementation is available (as open-source) in the QD_Real class, created some time ago by Yozo Hida (a PhD Student, at the time, I believe).

The linked repository contains a lot of code, much of which is likely unnecessary for your use-case. Below, I have shown an extremely trimmed-down version, which allows creation of data with the required precision, shows an implementation of the required operator-() and a test case.

#include <iostream>

class ddreal {
private:
    static inline double Plus2(double a, double b, double& err) {
        double s = a + b;
        double bb = s - a;
        err = (a - (s - bb)) + (b - bb);
        return s;
    }
    static inline void Plus3(double& a, double& b, double& c) {
        double t3, t2, t1 = Plus2(a, b, t2);
        a = Plus2(c, t1, t3);
        b = Plus2(t2, t3, c);
    }
public:
    double x[2];
    ddreal() { x[0] = x[1] = 0.0; }
    ddreal(double hi) { x[0] = hi; x[1] = 0.0; }
    ddreal(double hi, double lo) { x[0] = Plus2(hi, lo, x[1]); }
    ddreal& operator -= (ddreal const& b) {
        double t1, t2, s2;
        x[0] = Plus2(x[0], -b.x[0], s2);
        t1 = Plus2(x[1], -b.x[1], t2);
        x[1] = Plus2(s2, t1, t1);
        t1 += t2;
        Plus3(x[0], x[1], t1);
        return *this;
    }
    inline double toDouble() const { return x[0] + x[1]; }
};

inline ddreal operator-(ddreal const& a, ddreal const& b)
{
    ddreal retval = a;
    return retval -= b;
}

int main()
{
    double sdone{ 1.0 };
    double sdwee{ 1.0e-42 };
    double sdval = sdone - sdwee;
    double sdans = sdone - sdval;
    std::cout << sdans << "\n"; // Gives zero, as expected

    ddreal ddone{ 1.0 };
    ddreal ddwee{ 1.0e-42 };
    ddreal ddval = ddone - ddwee; // Can actually hold 1 - 1.0e42 ...
    ddreal ddans = ddone - ddval;
    std::cout << ddans.toDouble() << "\n"; // Gives 1.0e-42

    ddreal ddalt{ 1.0, -1.0e-42 }; // Alternative initialization ...
    ddreal ddsec = ddone - ddalt;
    std::cout << ddsec.toDouble() << "\n"; // Gives 1.0e-42

    return 0;
}

Note that I have deliberately neglected error-checking and other overheads that would be needed for a more general implementation. Also, the code I have shown has been 'tweaked' to work more optimally on x86/x64 CPUs, so you may need to delve into the code at the linked GitHub, if you need support for other platforms. (However, I think the code I have shown will work for any platform that conforms strictly to the IEEE-754 Standard.)

I have tested this implementation, extensively, in code I use to generate and display the Mandelbrot Set (and related fractals) at very deep zoom levels, where use of the raw double type fails completely.

Note that, though you may be tempted to 'optimize' some of the seemingly pointless operations, doing so will break the system. Also, this must be compiled using the /fp:precise (or /fp:strict) flags (with MSVC), or the equivalent(s) for other compilers; using /fp:fast will break the code, completely.

Adrian Mole
  • 49,934
  • 160
  • 51
  • 83