Find float a to closest multiple of float b

Question

C++ Scenario: I have two variables of type double a and b.

Goal: a should be set to the closest multiple of b that is smaller than a.

First approach: Use fmod() or remainder() to get r. Then do a = a - r. I know that due to the representation of decimal numbers in memory fmod() or remainder() can never guarantee 100% accuracy. In my tests I found that I cannot use fmod() at all, as the variance of its results is too unpredictable (at least as far as I understand). There are many questions and discussions out there talking about this phenomenon. So is there something I could do to still use fmod()? With “something” I mean some trick similar to checking if a equals b by employing a value double

EPSILON = 0.005;
if (std::abs(a-b) < EPSILON)
   std::cout << "equal" << '\n';

My second approach works but seems not to be very elegant. I am just subtracting b from a until there is nothing left to subtract:

double findRemainder(double x, double y) {
    double rest;
    if (y > x)
    {
        double temp = x;
        x = y;
        y = temp;
    }

    while (x > y)
    {
        rest = x - y;
        x = x - y;
    }
    return rest;
}

int main()
{ 
   typedef std::numeric_limits<double> dbl;
   std::cout.precision(dbl::max_digits10);
   double a = 13.78, b = 2.2, r = 0;
   r = findRemainder(a, b);
   return 0;
}

Any suggestions for me?

@VladimirNikitin If you are unlucky, `a/b` could round up to the next integer, making the result too large (currently too lazy to find an example, but basically you need some `b` and `a=something.5000001 * b`) — chtz, May 13 '20 at 08:57
Just btw: `fmod` is always exact (at least for finite input and `b!=0`): https://stackoverflow.com/questions/20928253/is-fmod-exact-when-y-is-an-integer. But `a - fmod(a,b)` might round incorrectly, especially if `a` is much larger than `b`. And if you can't guarantee that both inputs are positive, you need to handle the signs somehow. — chtz, May 13 '20 at 09:45
Note that floating point number "that is a multiple of" another number is not necessarily the same number that is the multiple in actual math. — eerorika, May 13 '20 at 09:54
@chtz thank you for your reply. So I believe we say it is exact because `double m=0.9, n=0.1` will result in `0.999...` because `0.9` is represented as `0.90000000000000002` and `0.1` as `0.10000000000000001` which means the result is correct given the underlying numbers. But then I do not understand how to use `fmod()` in my program. How can I know when `fmod(m,n)` is going to give me a useful result - and when it is going to give me something that would lead to a wrong calculation? Obviously I would expect a result close to zero. — MYZ, May 13 '20 at 11:18
For positive `a` and `b`, this should result in the closest value to the largest integer multiply of `b` smaller than `a`: `a -= fmod(a, b);` — chtz, May 13 '20 at 13:20
MertY, `fmod()` is not the problem. How code starting with 0.9 as a string or a code constant result in a `double` _near_ 0.9, is the issue. To resolve that, we need the higher level goal. — chux - Reinstate Monica, May 13 '20 at 17:35
MertY Please provide a case where `a -= fmod(a,b);` did not work for you. I'd expect that is better than `b * floor(a / b)`. In that failed case, what where the values of `a,b` and what result was seen and what result was expected? — chux - Reinstate Monica, May 13 '20 at 17:44
Thank you for your comments! The case `a=0.96`, `b=0.1` works as intended. I get (as a rounded result of `a -= fmod(a,b) = 0.9`. But the value pair of `a=0.9`, `b=0.1` results in `0.8`, whereas I would expect `0.9` as I would expect `fmod(a,b)` to return zero (or something close to it). I understood now why `fmod` behaves the way it does, I am just not sure how to handle this in a reliable way. This is a calculation based on user input parameters, so it needs to behave the same predictable way for all kind of values. — MYZ, May 14 '20 at 07:33
@chux-ReinstateMonica I should add: The numbers are not converted from any string format, they are input by users, checked for some bounds, and then passed to the function. — MYZ, May 14 '20 at 07:37
@MertY Try the [same numbers](https://stackoverflow.com/questions/61769989/find-float-a-to-closest-multiple-of-float-b?noredirect=1#comment109296685_61769989), but print with `printf("%.20e %.20e %.20e\n", a,b,fmod(a,b));` for greater insight. — chux - Reinstate Monica, May 14 '20 at 13:05
@MertY "I am just not sure how to handle this" --> `fmod()` gets the remainder, yet you want the "closest.". Try `a -= (fmod(a + b/2, b) + b/2);`. Does that meet your goal? — chux - Reinstate Monica, May 14 '20 at 13:11
@chux-ReinstateMonica unfortunately that gives me the same result `0.800...04`. — MYZ, May 15 '20 at 08:29
@chux-ReinstateMonica okay my first tests suggest this works :) After work I will sit down and try to understand why :-) thank you a lot for your multiple answers and suggestions! — MYZ, May 15 '20 at 10:40

Eric Postpischil · Answer 1 · 2020-07-27T21:26:47.127

Preamble

The problem is impossible, both as stated and as intended.

Remainders are exact

This statement is incorrect: “fmod() or remainder() can never guarantee 100% accuracy.” If the floating-point format supports subnormal numbers (as IEEE-754 does), then fmod(x, y) and remainder are both exact; they produce a result with no rounding error (barring bugs in their implementation). The remainder, as defined for either of them, is always less than y and not more than x in magnitude. Therefore, it is always in a portion of the floating-point format that is at least as fine as y and as x, so all the bits needed for the real-arithmetic remainder can be represented in the floating-point remainder. So a correct implementation will return the exact remainder.

Multiples may not be representable

For simplicity of illustration, I will use IEEE-754 binary32, the format commonly used for float. The issues are the same for other formats. In this format, all integers with magnitude up to 2²⁴, 16,777,216, are representable. After that, due to the scaling by the floating-point exponent, the representable values increase by two: 16,777,218, 16,777,220, and so on. At 2²⁵, 33,554,432, they increase by four: 33,554,436, 33,554,440. At 2²⁶, 67,108,864, they increase by eight.

100,000,000 is representable, and so are 99,999,992 and 100,000,008. Now consider asking what multiple of 3 is the closest to 100,000,000. It is 99,999,999. But 99,999,999 is not representable in the binary32 format.

Thus, it is not always possible for a function to take two representable values, a and b, and return the greatest multiple of b that is less than a, using the same floating-point format. This is not because of any difficulty computing the multiple but simply because it is impossible to represent the true multiple in the floating-point format.

In fact, given the standard library, it is easy to compute the remainder; std::fmod(100000000.f, 3.f) is 1. But it is impossible to compute 100000000.f − 1 in the binary32 format.

The intended question is impossible

The examples shown, 13.78 for a and 2.2 for b, suggest the desire is to produce a multiple for some floating-point numbers a and b that are the results of converting decimal numerals a and b to the floating-point format. However, once such conversions are performed, the original numbers cannot be known from the results a and b.

To see this, consider values for a of either 99,999,997 or 100,000,002 while b is 10. The greatest multiple of 10 less than 99,999,997 is 99,999,990, and the greatest multiple of 10 less than 100,000,002 is 100,000,000.

When either 99,999,997 or 100,000,002 is converted to the binary32 format (using the common method, round-to-nearest-ties-to-even), the result for a is 100,000,000. Converting b of course yields 10 for b.

Then a function that converts the greatest multiple of a that is less than b can return only one result. Even if this function uses extended precision (say binary64) so that it can return either 99,999,990 or 100,000,000 even though those are not representable in binary32, it has no way to distinguish them. Whether the original a is 99,999,997 or 100,000,002, the a given to the function is 100,000,000, so there is no way for it to know the original a and no way for it to decide which result to return.

Absolutely related to this question, the implementation of truncateTo: in Pharo Smalltalk. See https://github.com/pharo-project/pharo/issues/4728#issuecomment-537164376 and why I think it is a false problem we shall even not try to solve. — aka.nice, May 14 '20 at 09:36

aka.nice · Answer 2 · 2020-05-13T19:29:27.277

Hmm, there really is a problem of definition, because most multiples of a floating point won't be representable exactly, except maybe if the multiplier is a power of two.

Taking your example and Smalltalk notations (which does not really matter, I do it just because i can evaluate and verify the expressions I propose), the exact fractional representation of double precision 0.1 and 0.9 can be written:

(1+(1<<54)reciprocal) / 10 = 0.1.
(9+(1<<52)reciprocal) / 10 = 0.9.

<< is a bistshift, 1<<54 is 2 raised to the power of 54, and reciprocal is its inverse 2^-54.

As you can easily see:

(1+(1<<54)reciprocal) * 9 > (9+(1<<52)reciprocal)

That is, the exact multiple of 0.1 is greater than 0.9.
Thus, technically, the answer is 8*0.1 (which is exact in this lucky case)

(8+(1<<51)reciprocal) / 10 = 0.8.

What remainder does is to give the EXACT remainder of the division, so it is related to above computations somehow.

You can try it, you will find something like-2.77555...e-17, or exactly (1<<55) reciprocal. The negative part is indicating that nearest multiple is close to 0.9, but a bit below 0.9.

However, if your problem is to find the greatest <= 0.9, among the rounded to nearest multiple of 0.1, then your answer will be 0.9, because the rounded product is 0.1*9 = 0.9.

You have to first resolve that ambiguity. If ever, you are not interested in multiples of 0.1, but in multiples of (1/10), then it's again a different matter...

Find float a to closest multiple of float b

2 Answers2

Preamble

Remainders are exact

Multiples may not be representable

The intended question is impossible

Linked