Round floating point number to smaller precision

Question

Is there a reasonably portable way to round a floating point number to a smaller precision, e.g. reduce the precision by 3 binary digits?

For example, an IEEE double has 53 bits of precision. I would like to round (or perhaps truncate) it to 50 bits of precision.

Is this possible with standard library facilities, without having to assume too much about the floating point representation (e.g. having to know the layout of bits)? It is unclear to me if the standard library has sufficient information about the representation (FLT_RADIX, DBL_MANT_DIG, etc.) to make this feasible.

Are you targeting specific representation? or general way? You probably can't do the later case. — Jarod42, Dec 09 '20 at 12:10
Answer from @nousie94 reposted as comment: This [Answer](https://stackoverflow.com/a/11101890/14793387) from the user schibum should help you, if you have c++11 or higher and if cmath is a standard library for you. — klutt, Dec 09 '20 at 12:16

chux - Reinstate Monica · Answer 1 · 2020-12-09T19:40:12.967

to round a floating point number to a smaller precision

Is this possible with standard library facilities, without having to assume too much about the floating point representation

If we assume the common FLT_RADIX == 2, easy enough to use frexp() to extract the significand, round (or truncate), and then reform.

#include <math.h>
#include <stdio.h>

double round3(double value) {
  // Get significand
  int exp;
  printf("%-22a %.20f\n", value, value);
  double normalized_fraction = frexp(value, &exp); // [0.5 ... 1.0) or zero,
  printf("%-22a %.20f\n", normalized_fraction, normalized_fraction);

  // Scale to a whole number range
  double normalized_integer = normalized_fraction * pow(FLT_RADIX, DBL_MANT_DIG);
  printf("%-22a %.20f %llX\n", normalized_integer, normalized_integer,
      (unsigned long long) normalized_integer);
  // .. but we really want it scaled by 3 bits less
  double normalized_integer_3 = normalized_fraction * pow(FLT_RADIX, DBL_MANT_DIG - 3);
  printf("%-22a %.20f\n", normalized_integer_3, normalized_integer_3);

  // round
  double rounded_normalized_integer_3 = round(normalized_integer_3);
  printf("%-22a %.20f\n", rounded_normalized_integer_3, rounded_normalized_integer_3);

  // reform
  double y = ldexp(rounded_normalized_integer_3, exp - (DBL_MANT_DIG - 3));
  printf("%-22a %.20f\n", y, y);
  puts("");
  return y;
}

Simplified

double round3(double value) {
  assert(FLT_RADIX == 2);
  int exp;
  double normalized_fraction = frexp(value, &exp); // [0.5 ... 1.0) or zero,
  double rounded_norm_integer_3 = round(normalized_fraction * pow(2, DBL_MANT_DIG-3));
  return ldexp(rounded_norm_integer_3, exp - (DBL_MANT_DIG - 3));
}

Test

int main() {
  round3(1.0/3.0);
}

Output

//              v --- last hex digit in binary 0101
0x1.5555555555555p-2   0.33333333333333331483
0x1.5555555555555p-1   0.66666666666666662966
0x1.5555555555555p+52  6004799503160661.00000000000000000000 15555555555555
0x1.5555555555555p+49  750599937895082.62500000000000000000
0x1.5555555555558p+49  750599937895083.00000000000000000000
0x1.5555555555558p-2   0.33333333333333348136
//              ^ --- last hex digit in binary 1000 (rounded)

To accommodate rare cases when FLT_RADIX != 2, other considerations are needed to maintain accuracy as frexp() and ldexp() work with powers-of-2.

For C++ users, sorry the output uses printf(), but that is only illustrative and can be removed.

There is a library function just for `normalized_fraction * pow(FLT_RADIX, DBL_MANT_DIG)`: `scalbn(normalized_fraction, DBL_MANT_DIG)`. However, this manual work with the floating-point components is unnecessary; the Veltkamp-Dekker split in the questions I marked as originals provides the desired rounding. — Eric Postpischil, Dec 09 '20 at 13:39
Just FYI, there's `std::format` to keep the iomanip nonsense at bay. — Passer By, Dec 09 '20 at 13:43
Re “… when `FLT_RADIX != 2`, other considerations…”: See `ilogb`. A `frexp` generalized to any base can be effected by `ilogb` followed by `scalbn` with the exponent negated. — Eric Postpischil, Dec 09 '20 at 13:46
@EricPostpischil The Veltkamp-Dekker split is subject to overflow. This only overflows, as it should, with values within a few steps of `DBL_MAX`. Perhaps I'll move this one there later. — chux - Reinstate Monica, Dec 09 '20 at 19:49

Round floating point number to smaller precision

1 Answers1