How to keep precision on int64_t = int64_t * float?

Question

I would like to perform a correction on an int64_t by a factor in the range [0.01..1.2] with precision is about 0.01. The naive implementation would be:

int64_t apply_correction(int64_t y, float32_t factor)
{
    return y * factor;
}

Unfortunately, I will loose precision either if I cast factor to int32 or if I cast y into float.

However, if I can ensure y has its maximum value below 1<<56, I can use this trick:

(1<<8) * (y / (int32_t)(factor * (1<<8)))

How can I solve this problem if my input value can be bigger than 1<<56?

Plot twist:

I am running on a 32-bit architecture where int64_t is an emulated type and where I don't have any support for double precision. The architecture is SHARC from Analog Devices.

@luk32 it will not work as `factor` is in the range 0.01-1.2. — fluter, Apr 26 '16 at 09:07
if you don't have support for double precision you can write a library for that, or just use tons of double/multiple precision libraries out there. But using double still doesn't help you because it has only 53 bits of precision, so you can't get full 64-bit precision — phuclv, Apr 26 '16 at 09:52
@fluter Oh, I misread `int64` for `float64` and completetly didn't understand what what the problem was. I even absent-mindedly retyped `int` thinking abotu float. I guess it was to early for even basic brain activity for me. — luk32, Apr 26 '16 at 11:26

John Zwinck · Answer 1 · 2016-04-26T09:55:26.937

3

How about doing it in integer space?

/* factor precision is two decimal places */
int64_t apply_correction(int64_t y, float32_t factor)
{
    return y * (int32_t)(factor * 100) / 100;
}

This does assume y is not very close to the maximum value, but it gets you a little closer than 56 bits.

edited Apr 26 '16 at 09:55

answered Apr 26 '16 at 09:06

John Zwinck

239,568
38
324
436

This `int32_t(factor * 100)` won't compile in C. – alk Apr 26 '16 at 09:08
1

@AnttiHaapala: Integer overflow is possible with any solution here, since an int64_t is multiplied by up to 1.2 then returned as the same type. I've updated my answer to mention the specific caveats. – John Zwinck Apr 26 '16 at 09:08
@AnttiHaapala: "*factor in the range [0.01..1.2] ... y has its maximum value below 1<<56*" which overlfow? – alk Apr 26 '16 at 09:10
@alk "How can I solve this problem if my input value can be bigger than 1<<56?" – Weather Vane Apr 26 '16 at 09:11
Aarg ... I parsed this 1<<56 completely wrong! :} @AnttiHaapala – alk Apr 26 '16 at 09:36
1

@JohnZwinck this gets you about 1 bit **closer** to maximum value. – Antti Haapala -- Слава Україні Apr 26 '16 at 09:49

Antti Haapala -- Слава Україні · Accepted Answer · 2016-04-26T09:53:27.693

If you calculate ((int64_t)1 << 57) * 100 or * 256, you will have a signed integer overflow, which would lead to your code having undefined behaviour. If instead you used uint64_t and the value, then your code would be well-defined but definedly ill-behaved.

However it is possible to make this work for numbers almost up to (1 << 63 / 1.2).

If y were an uint64_t you can split the original number to most-significant 32 bits shifted right by 32, and the least-significant 32 bits, multiply this by (int32_t)(factor * (1 << 8)).

Then you do not right-shift the most-significant bits by 8 after the multiplication, but left-shift by 24; then add together:

uint64_t apply_uint64_correction(uint64_t y, float32_t factor)
{
    uint64_t most_significant = (y >> 32) * (uint32_t)(factor * (1 << 8));
    uint64_t least_significant = (y & 0xFFFFFFFFULL) * (uint32_t)(factor * (1 << 8));     
    return (most_significant << 24) + (least_significant >> 8);
}

Now, apply_uint64_correction(1000000000000, 1.2) would result in 1199218750000, and apply_uint64_correction(1000000000000, 1.25) would result in 1250000000000.

Actually you can make more precision out of it if you can guarantee the range of factor:

uint64_t apply_uint64_correction(uint64_t y, float32_t factor)
{
    uint64_t most_significant = (y >> 32) * (uint32_t)(factor * (1 << 24));
    uint64_t least_significant = (y & 0xFFFFFFFFULL) * (uint32_t)(factor * (1 << 24));     
    return (most_significant << 8) + (least_significant >> 24);
}

apply_uint64_correction(1000000000000, 1.2) would give 1200000047683 on my computer; this is also the maximum precision you can get out of it, if float32_t has 24-bit mantissa.

The above algorithm would work for signed positive numbers as well, but as signed shifts for negative numbers are a grey area I'd take note of the sign, then convert the value to uint64_t, do the calculations portably, and then negate if original sign was negative.

int64_t apply_correction(int64_t y, float32_t factor) {
    int negative_result = 0;
    uint64_t positive_y = y;
    if (y < 0) {
        negative_result = 1;
        positive_y = -y;
    }

    uint64_t result = apply_uint64_correction(positive_y, factor);
    return negative_result ? -(int64_t)result : result;
}

What happens to the remainder/fraction from the most-significant part? — Weather Vane, Apr 26 '16 at 09:19

Lundin · Answer 3 · 2016-04-26T09:36:28.973

2

Just don't use float numbers.

int64_t apply_correction(int64_t y, float32_t factor)
{
  int64_t factor_i64 = factor * 100f;

  return (y * factor_i64) / 100ll;
}

This is assuming that y * factor_i64 * 100 will not overflow.

edited Apr 26 '16 at 09:36

answered Apr 26 '16 at 09:10

Lundin

195,001
40
254
396

Doesn't this assume `long long` is available on the system? – Qix - MONICA WAS MISTREATED Apr 26 '16 at 09:11
@Qix Yes, indeed it is assuming that standard C is used. You could also write `(int64_t)0`. At any rate it doesn't really matter because of implicit type promotion through balancing. That is, `(y * factor) / 100;` is equivalent and safe. I just like to be explicit with types. – Lundin Apr 26 '16 at 09:12
@cmaster Oops indeed, fixed. – Lundin Apr 26 '16 at 09:36
standard **C99** C, sure. `long long` wasn't available in c89/c90/c95, which should be taken into consideration given that this is an embedded system. – Qix - MONICA WAS MISTREATED Apr 26 '16 at 21:19
@Qix If you are maintaining old stuff, sure. But if your project is a new one, there are no excuses for using C90 still. Note that the types `int64_t` and `float32_t` were introduced in C99. Also per SO C tag policy: unless the OP specifically states that they are interested in an older version of the language, always assume they are using the current version of the standard, which is C11. – Lundin Apr 27 '16 at 06:31
Fair enough. It's always a concern, though, since a few embedded systems' compilers are still pinned to C89 due to the company going out of business, etc. – Qix - MONICA WAS MISTREATED Apr 27 '16 at 06:46

How to keep precision on int64_t = int64_t * float?

3 Answers3