Since float precision reduces for larger values, in some cases it may be useful to quantize the value based on its size - instead of quantizing by an absolute value.
A naive approach could be to detect the precision and scale it up:
float quantize(float value, float quantize_scale) {
float factor = (nextafterf(fabsf(value)) - fabsf(value)) * quantize_scale;
return floorf((value / factor) + 0.5f) * factor;
}
However this seems too heavy.
Instead, it should be possible to mask out bits in the floats mantisa to simulate something like casting to a 16bit float, then back - for eg.
Not being expert in float bit twiddling, I couldn't say if the resulting float would be valid (or need normalizing)
For speed, when exact behavior regarding rounding isn't important, what is a fast way to quantize floats, taking their magnitude into account?