I'm trying to reduce the precision of double
variables in C to test the effect on the results. I tried doing a bitwise &
, but it gives an error.
How can I do this on float
and double
variables?
I'm trying to reduce the precision of double
variables in C to test the effect on the results. I tried doing a bitwise &
, but it gives an error.
How can I do this on float
and double
variables?
How to reduce the precision of a double in C?
To reduce the relative precision of a floating point numbers such that various least significant bits of the significand/mantissa are zero'd, code needs to access the significand.
Use frexp()
to extract the signicand and exponent of the FP number.
Scale the signicand with ldexp()
and then round, truncate, or floor - depending in coding goals - to remove precision. Truncation is shown, yet I recommend rounding via rint()
Scale back and add back the exponent.
#include <math.h>
#include <stdio.h>
double reduce(double x, int precision_power_2) {
if (isfinite(x)) {
int power_2;
// The frexp functions break a floating-point number into a
// normalized fraction and an integral power of 2.
double normalized_fraction = frexp(x, &power_2); // 0.5 <= result < 1.0 or 0
// The ldexp functions multiply a floating-point number by an integral power of 2
double less_precise = trunc(ldexp(normalized_fraction, precision_power_2));
x = ldexp(less_precise, power_2 - precision_power_2);
}
return x;
}
void testr(double x, int pow2) {
printf("reduce(%a, %d --> %a\n", x, pow2, reduce(x, pow2));
}
int main(void) {
testr(0.1, 5);
return 0;
}
Output
// v-53 bin.digs-v v-v 5 significant binary digits
reduce(0x1.999999999999ap-4, 5 --> 0x1.9p-4
Use frexpf()
, ldexp()
, rintf()
, truncf()
, floorf()
, etc. for float
.
If you wish to apply the bitwise and &
, you need to apply it to the integer representation of the float
value:
float f = 0.1f;
printf("Befor: %a %.16e\n", f, f);
unsigned int i;
_Static_assert(sizeof f == sizeof i, "pick integer type of the correct size");
memcpy(&i, &f, sizeof i);
i &= ~ 0x3U; // or any other mask.
// This one assumes the endianness of floats is identical to integers'
memcpy(&f, &i, sizeof f);
printf("After: %a %.16e\n", f, f);
Note that this does not provide you with 29-bit IEEE-754-like numbers. The value in f
was first rounded as a 32-bit single-precision number, and then brutally truncated.
A more elegant method relies on a floating-point constant with two bits set:
float f = 0.1f;
float factor = 5.0f; // or 3, or 9, or 17
float c = factor * f;
f = c - (c - f);
printf("After: %a %.16e\n", f, f);
The advantage of this method is that it rounds f
to the nearest value using N bits of significand, as opposed to truncating it towards zero as in the first method. However, the program is still computing with 32-bit IEEE 754 floating-point and then rounding to fewer bits, so the result is still not always equivalent to what a narrower floating-point type would have produced.
The second method relies on an idea by Dekker, described online in this article.