2

Since float precision reduces for larger values, in some cases it may be useful to quantize the value based on its size - instead of quantizing by an absolute value.

A naive approach could be to detect the precision and scale it up:

float quantize(float value, float quantize_scale) {
    float factor = (nextafterf(fabsf(value)) - fabsf(value)) * quantize_scale;
    return floorf((value / factor) + 0.5f) * factor;
}

However this seems too heavy.

Instead, it should be possible to mask out bits in the floats mantisa to simulate something like casting to a 16bit float, then back - for eg.

Not being expert in float bit twiddling, I couldn't say if the resulting float would be valid (or need normalizing)


For speed, when exact behavior regarding rounding isn't important, what is a fast way to quantize floats, taking their magnitude into account?

ideasman42
  • 42,413
  • 44
  • 197
  • 320
  • What do you mean by “quantize” a float? And why would you want to? Discarding information from a computation is usually not a good thing to do. – Eric Postpischil Feb 09 '18 at 04:36
  • Minor: `floorf((value / factor) + 0.5f) * factor;` is an issue with `value < 0`. The `floorf()` and `+` are the wrong direction. – chux - Reinstate Monica Feb 09 '18 at 04:37
  • 2
    If you do need to separate a floating-point value’s high bits, the way to do it is [the Veltkamp-Dekker split](https://stackoverflow.com/questions/14285492/efficient-way-to-round-double-precision-numbers-to-a-lower-precision-given-in-nu/14285800#14285800). – Eric Postpischil Feb 09 '18 at 04:38
  • For speed: perhaps research `frexpf()` and then scale, round, scale back the normalized fraction. – chux - Reinstate Monica Feb 09 '18 at 04:38
  • Likely duplicate of [Efficient way to round double precision numbers to a lower precision given in number of bits](https://stackoverflow.com/questions/14285492/efficient-way-to-round-double-precision-numbers-to-a-lower-precision-given-in-nu/14285800#14285800). Given the description of “mask out bits” in the significand and “exact behavior regarding rounding isn’t important,” this is satisfied by the Veltkamp-Dekker split. – Eric Postpischil Feb 09 '18 at 04:41
  • @eric-postpischil Some algorithms require quantized input - Bentley-Ottmann algorithm for eg. Cork Boolean Library does this too. – ideasman42 Feb 09 '18 at 04:41
  • @eric-postpischil - don't think it's a duplicate although it's close. This question Is not about C# or double precision floats. – ideasman42 Feb 09 '18 at 04:45
  • @ideasman42: C# and the precision are irrelevant. Veltkamp-Dekker is general algorithm; the language it is coded in is irrelevant, except one must ensure that proper IEEE 754 operations are used. (E.g., in C, write `double t0 = d-x, t1 = d - t0;` rather than `double t1 = d - (d-x);`, because the compiler is allowed to use extra precision with the latter.) And for `float` or other precisions, you merely adjust the multiplier in the algorithm. – Eric Postpischil Feb 09 '18 at 04:47
  • @eric-postpischil: languages *can* be relevant, some answers using builtin functions for eg. But agree its not necessarily relevant. The other question specifically asks about reducing bits. This question is more general. – ideasman42 Feb 09 '18 at 04:56
  • Float **precision** is not reduced for large values (it still has the same nubmer of bits in the significand), it is the same. But the gaps between values can be large than one. – Rudy Velthuis Feb 09 '18 at 10:09
  • Right, the issue is quantizing by a fixed value will have fewer gaps between them as the number increases. – ideasman42 Feb 09 '18 at 10:35
  • I asked a [similar question for Java and posted an answer](https://stackoverflow.com/q/48727424/444402) about downcasting to a float and upcasting back to a double. If you want performance, you probably should quantize values in intervals of some power of 2. I believe you should be able to substitute `Math.getExponent()`, `Math.scalb()`, and `Math.rint()` with `frexp()`, `ldexp()`, and `rint()` to reproduce my result for the normal range, but I'm not sure how `frexp()` handles subnormals. This may be heavier than necessary since I wanted round-to-nearest-even but you don't care about rounding. – Kevin Jin Feb 11 '18 at 14:31

1 Answers1

1

The Veltkamp-Dekker splitting algorithm will split a floating-point number into high and low parts. Sample code is below.

If there are s bits in the significand (53 in IEEE 754 64-bit binary), and the value Scale in the code below is 2b, then *x0 receives the high s-b bits of x, and *x1 receives the remaining bits, which you may discard (or remove from the code below, so it is never calculated). If b is known at compile time, e.g., the constant 43, you can replace Scale with the appropriate constant, such as 0x1p43. Otherwise, you must produce 2b in some way.

This requires round-to-nearest mode. IEEE 754 arithmetic suffices, but other reasonable arithmetic may be okay too. It rounds ties to even.

This assumes that x * (Scale + 1) does not overflow. The operations must be evaluated in the same precision as the value being separated. (double for double, float for float, and so on. If the compiler evaluates float expressions with double, this would break. A workaround would be to convert the inputs to the widest floating-point type supported, perform the split in that type [with Scale adjusted correspondingly], and then convert back.)

void Split(double *x0, double *x1, double x)
{
    double d = x * (Scale + 1);
    double t = d - x;
    *x0 = d - t;
    *x1 = x - *x0;
}
Eric Postpischil
  • 195,579
  • 13
  • 168
  • 312
  • Why would this method be used instead of masking out bits in the float's mantisa, which I assume would be faster? – ideasman42 Feb 09 '18 at 05:00
  • 1
    @ideasman42: Masking bits is slower on some processors, as it may require moving a floating-point value to integer registers and back. And it does not handle subnormal values correctly. The above uses only three floating-point operations to produce the high bits (assuming `Scale+1` is known at compile time). – Eric Postpischil Feb 09 '18 at 05:06
  • First used by T. J. Dekker, "A floating-point technique for extendending the available precision", Report MR 118/70, Computation Department, Mathematical Centre, Amsterdam. ([online](https://ir.cwi.nl/pub/9159)). Dekker credits personal communication with G. W. Veltkamp in 1968. The report was reworked into the paper T. J. Dekker, "A floating-point technique for extending the available precision". *Numerische Mathematik*, Vol. 18, No. 3, June 1971, pp. 224-242. – njuffa Feb 09 '18 at 08:23
  • For mathematical analysis, see this presentation: Sylvie Boldo, "Veltkamp & Dekker revisited", TYPES Workshop on Numbers and Proofs, Orsay (France), June 12-13, 2006 ([online](ftp://ftp-sop.inria.fr/marelle/Laurent.Thery/microsoft/Sylvie.Boldo.pdf)) – njuffa Feb 09 '18 at 08:26
  • 1
    "This assumes that `x * (Scale + 1)` does not overflow" is important. If you are handling numbers at the edges of the representable range, you must add some branching logic to handle them, as I realized [here](https://stackoverflow.com/questions/14285492/efficient-way-to-round-double-precision-numbers-to-a-lower-precision-given-in-nu/14285800#comment84454596_14285800). – Kevin Jin Feb 11 '18 at 14:37
  • @ideasman42: Additionally, masking bits truncates, whereas the Veltkamp-Dekker split rounds. – Eric Postpischil Dec 09 '20 at 13:42