Mathematic operation between integer and float in a 32-bit system

Question

On a 32-bit system, I found that the operation below always return the correct value when a < 2^31 but returns random results where a is larger.

uint64_t a = 14227959735;
uint64_t b = 32768;
float c = 256.0;
uint64_t d = a - b/ c; // d returns 14227959808

I believe the problem here is that the int-to-float operation returns undefined behavior, but could someone help explain why it gives such a value?

floating point calc only accurate to 7 digits http://stackoverflow.com/questions/9765744/precision-in-c-floats — pm100, Dec 21 '16 at 01:38
There's nothing random or undefined, and the result you're getting is very nearly correct (off by one part in 100 million). It's just a rounding error. If you used `double` rather than `float` you'd get the exact result, though you could still get rounding errors for larger input values. — Keith Thompson, Dec 21 '16 at 01:45
There are a large number of questions of which this could be a duplicate; the one I chose has the merit of being one of the oldest on SO (and it has excellent answers). Incidentally, the result would be the same on a 64-bit system; the problem is that `float` (almost always) uses 4 bytes which doesn't give very many decimal digits of precision. — Jonathan Leffler, Dec 21 '16 at 01:48

score 2 · Accepted Answer · answered Dec 21 '16 at 01:29

2

The entire calculation goes to float, then gets cast to a 64 bit integer. But floats can't accurately represent large integers, unless they happen to be powers of two.

answered Dec 21 '16 at 01:29

Malcolm McLean

6,258
1
17
18

Mathematic operation between integer and float in a 32-bit system

1 Answers1