Efficient way to round double precision numbers to a lower precision given in number of bits

Question

In C#, I want to round doubles to a lower precision so that I can store them in buckets of varying size in an associative array. Unlike the usual rounding, I want to round to a number of significant bits. Thus large numbers will change in absolute terms much more than small numbers, but they will tend to change the same proportionately. So if I want to round to 10 binary digits, I find the ten most significant bits, and zero out all the lower bits, possibly adding a small number for rounding up.

I prefer "half-way" numbers be rounded up.

If it were an integer type, here would be a possible algorithm:

  1. Find: zero-based index of the most significant binary digit set H.
  2. Compute: B = H - P, 
       where P is the number of significant digits of precision to round
       and B is the binary digit to start rounding, where B = 0 is the ones place, 
       B = 1 is the twos place, etc. 
  3. Add: x = x + 2^B 
       This will force a carry if necessary (we round halfway values up).
  4. Zero out: x = x mod 2^(B+1). 
       This clears the B place and all lower digits.

The problem is finding an efficient way to find the highest bit set. If I were using integers, there are cool bit hacks to find the MSB. I do not want to call Round(Log2(x)) if I can help it. This function will be called many millions of times.

Note: I have read this SO question:

What is a good way to round double-precision values to a (somewhat) lower precision?

It works for C++. I am using C#.

UPDATE:

This is the code (modified from what the answerer supplied) as I am using it:

/// <summary>
/// Round numbers to a specified number of significant binary digits.
/// 
/// For example, to 3 places, numbers from zero to seven are unchanged, because they only require 3 binary digits,
/// but larger numbers lose precision:
/// 
///      8    1000 => 1000   8
///      9    1001 => 1010  10
///     10    1010 => 1010  10
///     11    1011 => 1100  12
///     12    1100 => 1100  12
///     13    1101 => 1110  14
///     14    1110 => 1110  14
///     15    1111 =>10000  16
///     16   10000 =>10000  16
///     
/// This is different from rounding in that we are specifying the place where rounding occurs as the distance to the right
/// in binary digits from the highest bit set, not the distance to the left from the zero bit.
/// </summary>
/// <param name="d">Number to be rounded.</param>
/// <param name="digits">Number of binary digits of precision to preserve. </param>
public static double AdjustPrecision(this double d, int digits)
{
    // TODO: Not sure if this will work for both normalized and denormalized doubles. Needs more research.
    var shift = 53 - digits; // IEEE 754 doubles have 53 bits of significand, but one bit is "implied" and not stored.
    ulong significandMask = (0xffffffffffffffffUL >> shift) << shift;
    var local_d = d;
    unsafe
    {
        // double -> fixed point (sorta)
        ulong toLong = *(ulong*)(&local_d);
        // mask off your least-sig bits
        var modLong = toLong & significandMask;
        // fixed point -> float (sorta)
        local_d = *(double*)(&modLong);
    }
    return local_d;
}

UPDATE 2: Dekker's Algorithm

I derived this from Dekker's algorithm, thanks to the other respondent. It rounds to the closest value, instead of truncating as the above code does, and it uses only safe code:

private static double[] PowersOfTwoPlusOne;

static NumericalAlgorithms()
{
    PowersOfTwoPlusOne = new double[54];
    for (var i = 0; i < PowersOfTwoPlusOne.Length; i++)
    {
        if (i == 0)
            PowersOfTwoPlusOne[i] = 1; // Special case.
        else
        {
            long two_to_i_plus_one = (1L << i) + 1L;
            PowersOfTwoPlusOne[i] = (double)two_to_i_plus_one;
        }
    }
}

public static double AdjustPrecisionSafely(this double d, int digits)
{
    double t = d * PowersOfTwoPlusOne[53 - digits];
    double adjusted = t - (t - d);
    return adjusted;
}

UPDATE 2: TIMINGS

I ran a test and found that Dekker's algorithm is better than TWICE as fast!

Number of calls in test: 100,000,000
Unsafe Time = 1.922 (sec)
Safe Time = 0.799 (sec)

After reading `What is a good way to round double-precision values to a (somewhat) lower precision?` what have you come up with in terms of coding a solution.. why not try what you have read and if it needs to be refactored or optimized then report back here with results and or errors.. unless you already have code then post it along with your original question — MethodMan, Jan 11 '13 at 19:49
You might need to specific as to what is the problem implementing the SO post you've linked to. — Ani, Jan 11 '13 at 19:53
Of course. The othe rpost is for C++, and calls ldexp and frexp functions, for which I do not know an analog in C#. — Paul Chernoch, Jan 11 '13 at 20:12
Aren't Scalb(y, n) and Logb(x) the equivalent of scalb ilogb ? If yes they have the capabilities of frexp and ldexp with slightly different API... — aka.nice, Jan 11 '13 at 20:35

Eric Postpischil · Accepted Answer · 2018-02-09T04:50:19.270

8

Dekker’s algorithm will split a floating-point number into high and low parts. If there are s bits in the significand (53 in IEEE 754 64-bit binary), then *x0 receives the high s-b bits, which is what you requested, and *x1 receives the remaining bits, which you may discard. In the code below, Scale should have the value 2^b. If b is known at compile time, e.g., the constant 43, you can replace Scale with 0x1p43. Otherwise, you must produce 2^b in some way.

This requires round-to-nearest mode. IEEE 754 arithmetic suffices, but other reasonable arithmetic may be okay too. It rounds ties to even, which is not what you requested (ties upward). Is that necessary?

This assumes that x * (Scale + 1) does not overflow. The operations must be evaluated in double precision (not greater).

void Split(double *x0, double *x1, double x)
{
    double d = x * (Scale + 1);
    double t = d - x;
    *x0 = d - t;
    *x1 = x - *x0;
}

edited Feb 09 '18 at 04:50

answered Jan 11 '13 at 20:02

Eric Postpischil

195,579
13
168
312

Rounding to even will be OK if everything else works. I will look at Dekker's algorithm as you suggest. – Paul Chernoch Jan 11 '13 at 20:09
Your note lead me to a paper called "ACCURATE FLOATING-POINT SUMMATION" by Rump, Ogita, and Oishi. Pure magic! – Paul Chernoch Jan 11 '13 at 20:35
I may eventually go with this approach, once I understand how it works! – Paul Chernoch Jan 11 '13 at 23:47
1

It works thusly: First, we want `x*Scale`, because this has a high bit where we want it to make a later rounding happen at the right spot. But we actually want a slightly greater value that allows us to subtract x and still have that same high bit. So we calculate `x*Scale+x`, which is `x*(Scale+1)`. That is `d`. Next, observe that `d-(d-x)` would be exactly `x` if computed exactly. However, `d-x` is performed in floating point, so it is rounded. `d` is greater than `x`, so the low bits of `x` are below the significand. So `d-(d-x)` gives us `x` with those low bits rounded away. – Eric Postpischil Jan 12 '13 at 00:40
I tested this algorithm, and it rounds to the closest value, which is better thatn the unsafe code I used. I am about to test the speed. – Paul Chernoch Jan 23 '13 at 13:14
Your method (Dekker's algorithm) is better than twice as fast as the unsafe code. Great answer. – Paul Chernoch Jan 23 '13 at 13:41
I am not aware of Dekker's algorithm. Do you have a source that discusses it? – Z boson Apr 28 '15 at 12:48
The problem is that there is another Decker's algorithm https://en.wikipedia.org/wiki/Dekker%27s_algorithm – Z boson Apr 28 '15 at 12:51
1) `double d = x * (Scale + 1);` needs additional code to handle potential overflow (alluded to in the answer). 2) `double d = x * (Scale + 1);` may incur a rounding error which may be important on edge cases. – chux - Reinstate Monica Dec 20 '17 at 00:07
@chux: `x * (Scale + 1)` is intended to round, with the subsequent operations separating out the change caused by that rounding. Do you perceive a case where this is incorrect? Can you show an example with specific values? – Eric Postpischil Dec 20 '17 at 00:30
@EricPostpischil I failed to find a failing case, aside from overflow issues. Excuse me while I have a slice of [humble pie](https://en.wikipedia.org/wiki/Humble_pie) – chux - Reinstate Monica Dec 20 '17 at 02:40
I implemented this algorithm on Java and it failed on one pathological case I've tried, namely 1.026606072791026E308, which has a binary representation of `0 11111111110 0010010001100011000110011111011100100100111000111011`, with `b = 29`. I'll see if I can reproduce the problem in, and port my solution to .NET. – Kevin Jin Feb 11 '18 at 02:56
@KevinJin: As stated in the answer, the code shown assumes that `x * (Scale + 1)` does not overflow. To handle cases close to the upper limit of the format, you can simply scale down, *e.g.*, by multiplying by 2**-100, apply the algorithm, and scale back up, *e.g.*, by multiplying by 2**100. Note that the correct result may still be infinity, if the number was such that rounding it up at the selected position causes it to overflow the finite range. – Eric Postpischil Feb 11 '18 at 03:08
I see. There probably is some constant threshold at which the number first needs to be scaled down and then scaled back up. I'll see if I can find that threshold so that I can add a conditional branch that will make this solution work for the entire representable range. – Kevin Jin Feb 11 '18 at 03:12

JerKimball · Answer 2 · 2013-01-23T16:13:38.117

2

Interesting...never heard of a need for this, but I think you can "do it" via some funky unsafe code...

void Main()
{
    // how many bits you want "saved"
    var maxBits = 20;

    // create a mask like 0x1111000 where # of 1's == maxBits
    var shift = (sizeof(int) * 8) - maxBits;
    var maxBitsMask = (0xffffffff >> shift) << shift;

    // some floats
    var floats = new []{ 1.04125f, 2.19412347f, 3.1415926f};
    foreach (var f in floats)
    {
        var localf = f;
        unsafe
        {
            // float -> fixed point (sorta)
            int toInt = *(int*)(&localf);
            // mask off your least-sig bits
            var modInt = toInt & maxBitsMask;
            // fixed point -> float (sorta)
            localf = *(float*)(&modInt);
        }
        Console.WriteLine("Was {0}, now {1}", f, localf);
    }
}

And with doubles:

void Main()
{
    var maxBits = 50;
    var shift = (sizeof(long) * 8) - maxBits;
    var maxBitsMask = (0xffffffffffffffff >> shift) << shift;
    var doubles = new []{ 1412.04125, 22.19412347, 3.1415926};
    foreach (var d in doubles)
    {
        var local = d;
        unsafe
        {
            var toLong = *(ulong*)(&local);
            var modLong = toLong & maxBitsMask;
            local = *(double*)(&modLong);
        }
        Console.WriteLine("Was {0}, now {1}", d, local);
    }
}

Aww...I got unaccepted. :)

For completeness, here's using Jeppe's "unsafe-free" approach:

void Main()
{
    var maxBits = 50;
    var shift = (sizeof(long) * 8) - maxBits;
    var maxBitsMask = (long)((0xffffffffffffffff >> shift) << shift);
    var doubles = new []{ 1412.04125, 22.19412347, 3.1415926};
    foreach (var d in doubles)
    {
        var local = d;
        var asLong = BitConverter.DoubleToInt64Bits(d);
        var modLong = asLong & maxBitsMask;
        local = BitConverter.Int64BitsToDouble(modLong);
        Console.WriteLine("Was {0}, now {1}", d, local);
    }
}

edited Jan 23 '13 at 16:13

answered Jan 11 '13 at 20:10

JerKimball

16,584
3
43
55

Intriguing. You use floats. Does anything have to change for doubles? And does this work for both normalized and unnormalized doubles? – Paul Chernoch Jan 11 '13 at 22:02
@PaulChernoch Oh, as for normalized/unnormalized, not a bloody clue. :) – JerKimball Jan 11 '13 at 22:55
I figured out the double thing. It works! I do not know about normalized/unnormalized yet, but until I can do mre reserach, I will use this. Thanks! (BTW: I use shift = 53 - digits, since IEEE 754 has 53 bits of mantissa, even if one is implied. – Paul Chernoch Jan 11 '13 at 23:46
@PaulChernoch No problem - it *should* be relatively fast, too - I mean, it's all just bit-shifts and casts in theory... – JerKimball Jan 12 '13 at 00:17
@JerKimball: Unfortunately, it is not just bit shifts and casts. It is also a transfer of the data from wherever the processor keeps floating-point objects to where the processor keeps integer objects. In many modern processors, those are separate registers, and the transfer requires either a special move instruction or a round-trip through memory. That is fine for an occasional operation, but it can be a problem when it must be performed many times in a high-performance application. That is why a solution using only floating-point operations may be appealing. – Eric Postpischil Jan 12 '13 at 00:28
@EricPostpischil You are, of course, correct - I wonder how apt it would be to use the GPU for something like this...overkill? – JerKimball Jan 12 '13 at 00:30
@JerKimball: No, not overkill, provided the data is already in the GPU. There are various applications that need to process a lot of data quickly, and doing things like this is worth it. It still surprises me that, although computers are so fast today, people still have massive amounts of data they want to process even faster, even in consumer applications. – Eric Postpischil Jan 12 '13 at 00:42
2

Instead of using `unsafe` code, you can use the static methods `BitConverter.DoubleToInt64Bits` and `BitConverter.Int64BitsToDouble` to convert between `double` (`System.Double`) and `long` (`System.Int64`). You can then do the bit-wise AND on the `long` value in "safe" context, of course. – Jeppe Stig Nielsen Jan 15 '13 at 12:26
@jeppe-stig-nielsen good call. I'd be interested as to how they compare performance-wise, given the original question – JerKimball Jan 15 '13 at 15:09
As always, if you want to know about performance, measure for yourself. I wouldn't be surprised if `DoubleToInt64Bits` were implemented almost like your code above. Actually, I just found the source: `public static unsafe long DoubleToInt64Bits(double value) { return *((long *)&value); }` – Jeppe Stig Nielsen Jan 15 '13 at 15:29
@JeppeStigNielsen Hah - fair enough; only idly commenting since I wasn't in front of a computer. :) – JerKimball Jan 15 '13 at 15:31
When I have time, I will investigate using DoubleToInt64Bits or BitConverter. The implementation I show above performs very fast, but my boss prefers I not use unsafe code. (In some of my tests, I call it 32 million times when performing a convolution over a single insurance policy. Typically, the application will process 100's of thousands of policies. SO you can see, it needs to be REALLY fast.) – Paul Chernoch Jan 23 '13 at 12:15
@PaulChernoch @JeppeStigNielsen Added the `BitConverter` route example for completeness (momentary twinge of sadness as I got 'unaccepted' due to other way being faster;) ) – JerKimball Jan 23 '13 at 16:14

Efficient way to round double precision numbers to a lower precision given in number of bits

2 Answers2

Linked

Related