How to correctly represent a large numerical value in float without losing precision

Question

I have some values read as string from file, which are greater than 2^23. when converting them to float, the values are changing due to precision loss.

double d  = 50000167;
float f = (float)d;
f gives a value of 50000168.0

I can not use double due to design constraints. How to get the job done using floats

You can't. That's how floating point numbers work. `float` is simply not the right tool for the job here. If that precision isn't good enough for you, use `double` instead, and if that's not precise enough either, you should look into arbitrary precision / bignum libraries. — Blaze, Nov 08 '19 at 12:32
***How to get the job done using floats*** You can't because of this: [https://en.wikipedia.org/wiki/Single-precision_floating-point_format](https://en.wikipedia.org/wiki/Single-precision_floating-point_format) specifically ***Thus only 23 fraction bits of the significand appear in the memory format, but the total precision is 24 bits (equivalent to log10(224) ≈ 7.225 decimal digits).*** — drescherjm, Nov 08 '19 at 12:34
If all numbers are big but not that big, you can keep the common large factor or offset separately. — Evg, Nov 08 '19 at 12:59
Store `d / factor` or `d - offset` for some const `factor` or `offset`, not `d` itself. — Evg, Nov 08 '19 at 13:22

score 0 · Answer 1 · answered Nov 08 '19 at 12:53

0

In practice, you can't get the job done using floats. Floating point is the wrong tool to use when you need that much precision.

You probably should either use arbitrary precision arithmetic, or reduce your expectations about precision.

answered Nov 08 '19 at 12:53

eerorika

232,697
12
197
326

score 0 · Accepted Answer · answered Nov 08 '19 at 13:33

You might be able to get away with doing something Hacky As Fuck, to simulate a 40-bit float which you would then reconstitute into a double when you needed to use it, then smash it back into a float with an extra byte of precision stored along side it.

Here's a double's layout:

Compared to a float's layout:

So basically adding one byte will take you from 23 bits of precision to 31 bits of precision.

You'd need to take your original value as a double and mask out bits 24-31, writing them to your extra byte, which you'd save alongside your float. To switch back, you'd cast the float back to a double, then set that double's fraction bits 24-31 from your extra byte of storage.

According to @drescherjm's comment, this will give you about 10 digits of precision, when this particular case only needs 8.

There's a big fat ugly caveat to all this. If you declare a struct:

struct ugly40BitFloatHackityHackHack {
    float mostOfTheValue;
    BYTE extra8BitsOfPrecision;
};

... The compiler will often reserve 64 bits (the size of a double) due to "Memory Alignment". Basically, the computer works faster when variables are properly lined up in memory so they can directly map to CPU registers... on 32/64/whatever bit boundaries. There are ways to control memory alignment (or "pack" everything in as tightly as possible).

Ah. In rereading your question just now, you're not concerned about 'storage', but some arbitrary design constraint. I'm guessing this is either for a class, or your boss is a fossil harkening back to the days when the most expensive thing a CPU could do was floating point math. These days, CPUs spend way too much time waiting for memory. "Back In The Day" a floating point operation could take 20-40 cycles, and doubles were even more expensive than floats. These days, floating point operations are cheap (at or around 1 cycle), and accessing memory is expensive (a cache miss can cost around 100-200 cycles... linked lists can make your CPU weep tears of blood).

If your boss is worried about efficiency, use a profiler. Don't guess. Don't apply decades-out-of-date "wisdom". Use a profiler.

And if this really is just a class assignment, this may just be a way to teach you about precision limits in float vs double. Congratulate your teacher: It worked.

How to correctly represent a large numerical value in float without losing precision

2 Answers2