platform independent way to reduce precision of floating point constant values

Question

The use case:

I have some large data arrays containing floating point constants that. The file defining that array is generated and the template can be easily adapted.

I would like to make some tests, how reduced precision does influence the results in terms of quality, but also in compressibility of the binary.

Since I do not want to change other source code than the generated file, I am looking for a way to reduce the precision of the constants.

I would like to limit the mantissa to a fixed number of bits (set the lower ones to 0). But since floating point literals are in decimal, there are some difficulties, specifying numbers in a way that the binary representation does contain all zeros at the lower mantissa bits.

The best case would be something like:

#define FP_REDUCE(float)  /* some macro  */

static const float32_t veryLargeArray[] = {
  FP_REDUCE(23.423f), FP_REDUCE(0.000023f), FP_REDUCE(290.2342f),
  // ... 
};

#undef FP_REDUCE

This should be done at compile time and it should be platform independent.

I'm not clear how you think this is going to save space. If the array is still `float32_t`, then it'll occupy the same space whether the bits in the mantissa are 0 or 1. How are you going to handle the implied 1 bit at the start of the mantissa? Do you just need to set all the mantissa bits to zero? Are you then going to simply store the sign plus exponent info? Is that 8 bits or 9 bits? If it's 9 bits as I seem to remember (23 bits for mantissa rings bells), then you've lost half the gain compared to 8 bits. I don't know why you're so starved for space that this idea makes sense. — Jonathan Leffler, Dec 03 '18 at 16:13
See Wikipedia on [IEEE 754](https://en.wikipedia.org/wiki/IEEE_754). There are 23 mantissa bits, 8 exponent bits, and 1 sign bit (plus the implicit one-bit). Maybe you don't need all 8 bits of the exponent range and can use the sign and 7 bits of exponent; or maybe all your numbers are positive and you can do without the sign bit. But sign plus exponent is 9 bits in a 32-bit floating point number. — Jonathan Leffler, Dec 03 '18 at 16:16
@JonathanLeffler: I don't see where OP indicated that it's a size optimization, although that could be achieved with bit packing. It might equally be a prototype to test if using a smaller floating point type in some other component (e.g. a neural net model) would be acceptable. — R.. GitHub STOP HELPING ICE, Dec 03 '18 at 16:31
@JonathanLeffler: Yes, but it could even be a matter of shipping a smaller binary as part of a compressed archive or storing it on a compressed fs, in which case fewer significant bits in a giant static table would reduce (compressed) size. — R.. GitHub STOP HELPING ICE, Dec 03 '18 at 16:37
Re “since floating point literals are in decimal”: This is not necessarily so; C has hexadecimal floating-point literals. You could rewrite the literals in your source code as hexadecimal. But, of course, you could also write a preprocessing program that converts floating-point literals in source code to whatever values you like, thus completely avoiding any need for C preprocessing, constant-expression, or compound-literal-union shenanigans. (You say you do not want to change the source, but the preprocessor program could run at build time, taking the original source as input.) — Eric Postpischil, Dec 03 '18 at 17:13

R.. GitHub STOP HELPING ICE · Answer 1 · 2018-12-03T17:27:56.677

What you're asking for can be done with varying degrees of partial portability, but not absolute unless you want to run the source file through your own preprocessing tool at build time to reduce the precision. If that's an option for you, it's probably your best one.

Short of that, I'm going to assume at least that your floating point types are base 2 and obey Annex F/IEEE semantics. This should be a reasonable assumption, but the latter is false with gcc on platforms (including 32-bit x86) with extended-precision under the default standards-conformance profile; you need -std=cNN or -fexcess-precision=standard to fix it.

One approach is to add and subtract a power of two chosen to cause rounding to the desired precision:

#define FP_REDUCE(x,p) ((x)+(p)-(p))

Unfortunately, this works in absolute precisions, not relative, and requires knowing the right value p for the particular x, which is going to be equal to the value of the leading base-2 place of x, times 2 raised to the power of FLT_MANT_DIG minus the bits of precision you want. This cannot be evaluated as a constant expression for use as an initializer, but you can write it in terms of FLT_EPSILON and, if you can assume C99+, a preprocessor-token-pasting to form a hex float literal, yielding the correct value for this factor. But you still need to know the power of two for the leading digit of x; I don't see any way to extract that as a constant expression.

Edit: I believe this is fixable, so as not to need an absolute precision but rather automatically scale to the value, but it depends on correctness of a work in progress. See Is there a correct constant-expression, in terms of a float, for its msb?. If that works I will later integrate the result with this answer.

Another approach I like, if your compiler supports compound literals in static initializers and if you can assume IEEE type representations, is using a union and masking off bits:

union { float x; uint32_t r; } fr;
#define FP_REDUCE(x) ((union fr){.r=(union fr){x}.r & (0xffffffffu<<n)}.x)

where n is the number of bits you want to drop. This will round towards zero rather than to nearest; if you want to make it round to nearest, it should be possible by adding an appropriate constant to the low bits before masking, but you have to take care about what happens when the addition overflows into the exponent bits.

score 2 · Accepted Answer · answered Dec 03 '18 at 17:10

2

The following uses the Veltkamp-Dekker splitting algorithm to remove n bits (with rounding) from x, where p = 2ⁿ (for example, to remove eight bits, use 0x1p8f for the second argument). The casts to float32_t coerce the results to that type, as the C standard otherwise permits implementations to use more precision within expressions. (Double-rounding could produce incorrect results in theory, but this will not occur when float32_t is the IEEE basic 32-bit binary format and the C implementation computes this expression in that format or the 64-bit format or wider, as the former is the desired format and the latter is wide enough to represent intermediate results exactly.)

IEEE-754 binary floating-point is assumed, with round-to-nearest. Overflow occurs if x•(p+1) rounds to infinity.

#define RemoveBits(x, p) (float32_t) (((float32_t) ((x) * ((p)+1))) - (float32_t) (((float32_t) ((x) * ((p)+1))) - (x))))

answered Dec 03 '18 at 17:10

Eric Postpischil

195,579
13
168
312

This rounds to a requested absolute precision, not a particular number of significand bits. But see my answer and the linked question -- I think it's fixable. – R.. GitHub STOP HELPING ICE Dec 03 '18 at 17:29
@R..: No, it does not. It removes a number of bits of the significand, as stated. It is relative, not absolute. `printf("%a\n", RemoveBits(1.f/3, 0x1p8); printf("%a\n", RemoveBits(65536.f/3, 0x1p8);` prints “0x1.5556p-2” and “0x1.5556p+14”. – Eric Postpischil Dec 03 '18 at 17:49
this seems to work but sadly for at least one compiler the binary increases. On a first glance it seems that the compiler does not eval the expression at compile time but generates initializing code. It looks that the most portable solution is the theoretical least portable one with generating `uint32_t` arrays and casting them to `float32_t` because also the hex-floating-point-literals are not supported. – vlad_tepesch Dec 04 '18 at 10:20
@vlad_tepesch: Sounds like you have a really broken compiler. It should not be possible for it to generate initializing code; that's not even a feature C has. – R.. GitHub STOP HELPING ICE Dec 04 '18 at 13:51
@R.. I compiled the source as `C++` because even using the macro in the array initialization made it complaining about non constant initializers. – vlad_tepesch Dec 04 '18 at 15:28
@vlad_tepesch: You must have done something wrong if that happened, because the expression here is a constant expression. Or it's just a buggy compiler that doesn't actually support C... – R.. GitHub STOP HELPING ICE Dec 04 '18 at 18:35
Ich tried it with vs2010 and vs2015. Both do not support c99. – vlad_tepesch Dec 04 '18 at 19:50
@R.. i took a deeper look: The resulting Error C2099 seems to be specific for the `fp:strict` compiler settings. With setting to `precise` it works as expected. – vlad_tepesch Dec 05 '18 at 08:00

platform independent way to reduce precision of floating point constant values

2 Answers2

Linked