In C#, I want to round doubles to a lower precision so that I can store them in buckets of varying size in an associative array. Unlike the usual rounding, I want to round to a number of significant bits. Thus large numbers will change in absolute terms much more than small numbers, but they will tend to change the same proportionately. So if I want to round to 10 binary digits, I find the ten most significant bits, and zero out all the lower bits, possibly adding a small number for rounding up.
I prefer "half-way" numbers be rounded up.
If it were an integer type, here would be a possible algorithm:
1. Find: zero-based index of the most significant binary digit set H. 2. Compute: B = H - P, where P is the number of significant digits of precision to round and B is the binary digit to start rounding, where B = 0 is the ones place, B = 1 is the twos place, etc. 3. Add: x = x + 2^B This will force a carry if necessary (we round halfway values up). 4. Zero out: x = x mod 2^(B+1). This clears the B place and all lower digits.
The problem is finding an efficient way to find the highest bit set. If I were using integers, there are cool bit hacks to find the MSB. I do not want to call Round(Log2(x)) if I can help it. This function will be called many millions of times.
Note: I have read this SO question:
What is a good way to round double-precision values to a (somewhat) lower precision?
It works for C++. I am using C#.
UPDATE:
This is the code (modified from what the answerer supplied) as I am using it:
/// <summary>
/// Round numbers to a specified number of significant binary digits.
///
/// For example, to 3 places, numbers from zero to seven are unchanged, because they only require 3 binary digits,
/// but larger numbers lose precision:
///
/// 8 1000 => 1000 8
/// 9 1001 => 1010 10
/// 10 1010 => 1010 10
/// 11 1011 => 1100 12
/// 12 1100 => 1100 12
/// 13 1101 => 1110 14
/// 14 1110 => 1110 14
/// 15 1111 =>10000 16
/// 16 10000 =>10000 16
///
/// This is different from rounding in that we are specifying the place where rounding occurs as the distance to the right
/// in binary digits from the highest bit set, not the distance to the left from the zero bit.
/// </summary>
/// <param name="d">Number to be rounded.</param>
/// <param name="digits">Number of binary digits of precision to preserve. </param>
public static double AdjustPrecision(this double d, int digits)
{
// TODO: Not sure if this will work for both normalized and denormalized doubles. Needs more research.
var shift = 53 - digits; // IEEE 754 doubles have 53 bits of significand, but one bit is "implied" and not stored.
ulong significandMask = (0xffffffffffffffffUL >> shift) << shift;
var local_d = d;
unsafe
{
// double -> fixed point (sorta)
ulong toLong = *(ulong*)(&local_d);
// mask off your least-sig bits
var modLong = toLong & significandMask;
// fixed point -> float (sorta)
local_d = *(double*)(&modLong);
}
return local_d;
}
UPDATE 2: Dekker's Algorithm
I derived this from Dekker's algorithm, thanks to the other respondent. It rounds to the closest value, instead of truncating as the above code does, and it uses only safe code:
private static double[] PowersOfTwoPlusOne;
static NumericalAlgorithms()
{
PowersOfTwoPlusOne = new double[54];
for (var i = 0; i < PowersOfTwoPlusOne.Length; i++)
{
if (i == 0)
PowersOfTwoPlusOne[i] = 1; // Special case.
else
{
long two_to_i_plus_one = (1L << i) + 1L;
PowersOfTwoPlusOne[i] = (double)two_to_i_plus_one;
}
}
}
public static double AdjustPrecisionSafely(this double d, int digits)
{
double t = d * PowersOfTwoPlusOne[53 - digits];
double adjusted = t - (t - d);
return adjusted;
}
UPDATE 2: TIMINGS
I ran a test and found that Dekker's algorithm is better than TWICE as fast!
Number of calls in test: 100,000,000
Unsafe Time = 1.922 (sec)
Safe Time = 0.799 (sec)