Rather than resorting to an arbitrary precision solution (which, as others have said, would potentially be extremely slow), you could simply create a class that extends the inherent precision of the double
type by a factor of (approximately) two. You would then only need to implement the operations that you actually need: in your case, this may only be subtraction (and possibly addition), which are both reasonably easy to achieve. Such code will still be considerably slower than using native types, but likely much faster than libraries that use unnecessary precision.
Such an implementation is available (as open-source) in the QD_Real class, created some time ago by Yozo Hida (a PhD Student, at the time, I believe).
The linked repository contains a lot of code, much of which is likely unnecessary for your use-case. Below, I have shown an extremely trimmed-down version, which allows creation of data with the required precision, shows an implementation of the required operator-()
and a test case.
#include <iostream>
class ddreal {
private:
static inline double Plus2(double a, double b, double& err) {
double s = a + b;
double bb = s - a;
err = (a - (s - bb)) + (b - bb);
return s;
}
static inline void Plus3(double& a, double& b, double& c) {
double t3, t2, t1 = Plus2(a, b, t2);
a = Plus2(c, t1, t3);
b = Plus2(t2, t3, c);
}
public:
double x[2];
ddreal() { x[0] = x[1] = 0.0; }
ddreal(double hi) { x[0] = hi; x[1] = 0.0; }
ddreal(double hi, double lo) { x[0] = Plus2(hi, lo, x[1]); }
ddreal& operator -= (ddreal const& b) {
double t1, t2, s2;
x[0] = Plus2(x[0], -b.x[0], s2);
t1 = Plus2(x[1], -b.x[1], t2);
x[1] = Plus2(s2, t1, t1);
t1 += t2;
Plus3(x[0], x[1], t1);
return *this;
}
inline double toDouble() const { return x[0] + x[1]; }
};
inline ddreal operator-(ddreal const& a, ddreal const& b)
{
ddreal retval = a;
return retval -= b;
}
int main()
{
double sdone{ 1.0 };
double sdwee{ 1.0e-42 };
double sdval = sdone - sdwee;
double sdans = sdone - sdval;
std::cout << sdans << "\n"; // Gives zero, as expected
ddreal ddone{ 1.0 };
ddreal ddwee{ 1.0e-42 };
ddreal ddval = ddone - ddwee; // Can actually hold 1 - 1.0e42 ...
ddreal ddans = ddone - ddval;
std::cout << ddans.toDouble() << "\n"; // Gives 1.0e-42
ddreal ddalt{ 1.0, -1.0e-42 }; // Alternative initialization ...
ddreal ddsec = ddone - ddalt;
std::cout << ddsec.toDouble() << "\n"; // Gives 1.0e-42
return 0;
}
Note that I have deliberately neglected error-checking and other overheads that would be needed for a more general implementation. Also, the code I have shown has been 'tweaked' to work more optimally on x86/x64 CPUs, so you may need to delve into the code at the linked GitHub, if you need support for other platforms. (However, I think the code I have shown will work for any platform that conforms strictly to the IEEE-754 Standard.)
I have tested this implementation, extensively, in code I use to generate and display the Mandelbrot Set (and related fractals) at very deep zoom levels, where use of the raw double
type fails completely.
Note that, though you may be tempted to 'optimize' some of the seemingly pointless operations, doing so will break the system. Also, this must be compiled using the /fp:precise
(or /fp:strict
) flags (with MSVC), or the equivalent(s) for other compilers; using /fp:fast
will break the code, completely.