68

I am writing a program for an embedded hardware that only supports 32-bit single-precision floating-point arithmetic. The algorithm I am implementing, however, requires a 64-bit double-precision addition and comparison. I am trying to emulate double datatype using a tuple of two floats. So a double d will be emulated as a struct containing the tuple: (float d.hi, float d.low).

The comparison should be straightforward using a lexicographic ordering. The addition however is a bit tricky because I am not sure which base should I use. Should it be FLT_MAX? And how can I detect a carry?

How can this be done?


Edit (Clarity): I need the extra significant digits rather than the extra range.

phuclv
  • 37,963
  • 15
  • 156
  • 475
  • Do you need the extra *range* of `double`, or just the extra significant digits? – dan04 Jul 21 '11 at 00:01
  • I need the extra significant digits. –  Jul 21 '11 at 00:03
  • 6
    Then what is the range of values you have? It makes a big difference whether the exponents differ greatly or not. If they are all in a similar range, you can perhaps use fixed point math. Especially if you only need addition and comparison, fixed point may be the way to go. – Rudy Velthuis Jul 21 '11 at 00:17
  • 1
    Exponents differ greatly. I have a almost three distinct intervals around 1e-3, 1e+10, and 1e+20. –  Jul 21 '11 at 00:19
  • 5
    @M.S: That seems to be a huge range, even for `double` precision. Specifically, `1.0E+20` and `1.0E-03` differ by more than epsilon (for `double` this is typically `1.0E-16` or so) so I'd expect that operations like `1.0E+20 + 1.0E-03` would equate to `1.0E+20`, even when using `double`. Will that be an issue?? – Darren Engwirda Jul 21 '11 at 00:50
  • Yes, this kind of rounding will be an issue. I need to re-work parts of the algoirthm to keep differences within epsilon. –  Jul 21 '11 at 01:08
  • Ah, these questions always bring out the best answers. . . – surfasb Jul 21 '11 at 03:04
  • @DarrenEngwirda but [double-double arithmetic](https://en.wikipedia.org/wiki/Quadruple-precision_floating-point_format#Double-double_arithmetic) doesn't work that way because the 2 elements can have different exponents and you can have a value like `1e30 + 1e-30` which is vastly larger than normal double significand. – phuclv Dec 01 '16 at 02:52

8 Answers8

103

double-float is a technique that uses pairs of single-precision numbers to achieve almost twice the precision of single precision arithmetic accompanied by a slight reduction of the single precision exponent range (due to intermediate underflow and overflow at the far ends of the range). The basic algorithms were developed by T.J. Dekker and William Kahan in the 1970s. Below I list two fairly recent papers that show how these techniques can be adapted to GPUs, however much of the material covered in these papers is applicable independent of platform so should be useful for the task at hand.

https://hal.archives-ouvertes.fr/hal-00021443 Guillaume Da Graça, David Defour Implementation of float-float operators on graphics hardware, 7th conference on Real Numbers and Computers, RNC7.

http://andrewthall.org/papers/df64_qf128.pdf Andrew Thall Extended-Precision Floating-Point Numbers for GPU Computation.

njuffa
  • 23,970
  • 4
  • 78
  • 130
  • 11
    +1 unlike the other answers this one actually addresses OP's question and gives very good links to the relevant papers. – R.. GitHub STOP HELPING ICE Jul 21 '11 at 06:31
  • Excellent! I was just looking for a way to make my `constexpr` functions more precise And it only took half an hour to implement the functions in the first link. – Slava P Aug 16 '16 at 12:36
11

This is not going to be simple.

A float (IEEE 754 single-precision) has 1 sign bit, 8 exponent bits, and 23 bits of mantissa (well, effectively 24).

A double (IEEE 754 double-precision) has 1 sign bit, 11 exponent bits, and 52 bits of mantissa (effectively 53).

You can use the sign bit and 8 exponent bits from one of your floats, but how are you going to get 3 more exponent bits and 29 bits of mantissa out of the other?

Maybe somebody else can come up with something clever, but my answer is "this is impossible". (Or at least, "no easier than using a 64-bit struct and implementing your own operations")

Nemo
  • 70,042
  • 10
  • 116
  • 153
  • 5
    using [float-float](https://en.wikipedia.org/wiki/Double-double_%28arithmetic%29#Double-double_arithmetic) technique he can't achieve double's range as well as precision, but that's significantly more than float and much faster than software double in case one has only hardware float arithmetics such as in case of the old CUDA or ARM CPUs – phuclv May 30 '14 at 12:49
9

It depends a bit on what types of operations you want to perform. If you only care about additions and subtractions, Kahan Summation can be a great solution.

MaxSem
  • 3,457
  • 21
  • 34
Mr Fooz
  • 109,094
  • 6
  • 73
  • 101
  • 1
    +1 interesting technique I hadn't heard of. Won't help though if his inputs need the extra precision to start with (not sure). – phkahler Jul 21 '11 at 01:21
  • 1
    @phkahler If an input to the summation is already float-float, just feed the two floats in separately. – Don Hatch Nov 19 '20 at 14:26
7

If you need both the precision and a wide range, you'll be needing a software implementation of double precision floating point, such as SoftFloat.

(For addition, the basic principle is to break the representation (e.g. 64 bits) of each value into its three consitituent parts - sign, exponent and mantissa; then shift the mantissa of one part based on the difference in the exponents, add to or subtract from the mantissa of the other part based on the sign bits, and possibly renormalise the result by shifting the mantissa and adjusting the exponent correspondingly. Along the way, there are a lot of fiddly details to account for, in order to avoid unnecessary loss of accuracy, and deal with special values such as infinities, NaNs, and denormalised numbers.)

Matthew Slattery
  • 45,290
  • 8
  • 103
  • 119
5

Given all the constraints for high precision over 23 magnitudes, I think the most fruitful method would be to implement a custom arithmetic package.

A quick survey shows Briggs' doubledouble C++ library should address your needs and then some. See this.[*] The default implementation is based on double to achieve 30 significant figure computation, but it is readily rewritten to use float to achieve 13 or 14 significant figures. That may be enough for your requirements if care is taken to segregate addition operations with similar magnitude values, only adding extremes together in the last operations.

Beware though, the comments mention messing around with the x87 control register. I didn't check into the details, but that might make the code too non-portable for your use.


[*] The C++ source is linked by that article, but only the gzipped tar was not a dead link.

wallyk
  • 56,922
  • 16
  • 83
  • 148
3

This is similar to the double-double arithmetic used by many compilers for long double on some machines that have only hardware double calculation support. It's also used as float-float on older NVIDIA GPUs where there's no double support. See Emulating FP64 with 2 FP32 on a GPU. This way the calculation will be much faster than a software floating-point library.

However in most microcontrollers there's no hardware support for floats so they're implemented purely in software. Because of that, using float-float may not increase performance and introduce some memory overhead to save the extra bytes of exponent.

If you really need the longer mantissa, try using a custom floating-point library. You can choose whatever is enough for you, for example change the library to adapt a new 48-bit float type of your own if only 40 bits of mantissa and 7 bits of exponent is needed. No need to spend time for calculating/storing the unnecessary 16 bits anymore. But this library should be very efficient because compiler's libraries often have assembly level optimization for their own type of float.

phuclv
  • 37,963
  • 15
  • 156
  • 475
2

That's not practical. If it was, every embedded 32-bit processor (or compiler) would emulate double precision by doing that. As it stands, none do it that I am aware of. Most of them just substitute float for double.

If you need the precision and not the dynamic range, your best bet would be to use fixed point. IF the compiler supports 64-bit this will be easier too.

phuclv
  • 37,963
  • 15
  • 156
  • 475
phkahler
  • 5,687
  • 1
  • 23
  • 31
  • 2
    It's actually very practical in case one doesn't need the extra range of `double` and was widely used in [old NVIDIA CUDA GPUs](http://stackoverflow.com/q/29344800/995714) when they [didn't support `double`](http://andrewthall.org/papers/df64_qf128.pdf) or in some compilers for [near-quadruple-precision](https://en.wikipedia.org/wiki/Quadruple-precision_floating-point_format#Double-double_arithmetic) when hardware support is not available – phuclv Dec 01 '16 at 03:12
2

Another software-based solution that might be of use: GNU MPFR
It takes care of many other special cases and allows arbitrary precision (better than 64-bit double) that you would have to otherwise take care of yourself.

Ioan
  • 2,382
  • 18
  • 32
  • In case the OP don't really need arbitrary precision (like 46-bit precision is enough) then MPFR is way overkill. Especially the OP already said this is on some **embedded hardware**, MPFR might be too big and too slow, or don't even have enough memory to run. OP just needs *the extra significant digits rather than the extra range* – phuclv Feb 04 '23 at 16:37