Casting from double to short decreasing the result in C

Question

I have a chunk of code which is calculating something. the calculation result is in double. But when I am trying to assign it to a short after casting it's decreasing. For example, if calculation result in double is 30.000000, after casting to short it's becoming 29.

short calcDays(int a, double b, double c)
{
    double result = (double) (a* (b/c));   // gives 30.000000
    short days = (short) result; // gives 29
    return days;
}

I have tried casting it to integer too. Same result.

EDIT : a is always multiple of 1, min value 1 max is 365 b and c is always multiple of .1, min value 1.0 max is 1000 a,b and c are coming from UI as service call param

Since `b` and `c` are both `double` the result of `b / c` will also be a `double`, and therefore will the result of the whole expression `a * (b / c)` be a `double`. The cast is not needed. — Some programmer dude, Jul 20 '21 at 08:09
Floating point is not exact. A number that prints as `30.0000` might actually be `30.0000000001` or `29.99999999999`. In the latter case, when you convert it to an integer you get `29`. You should round it instead of casting to short. — Barmar, Jul 20 '21 at 08:10
Floating-point arithmetic approximates real arithmetic. Arithmetic operations introduce rounding errors. Sometimes the accuracy can be improved. Try `a*b/c`. Show sample values for `a`, `b`, and `c` and describe what you are trying to do. When the result is not an integer, do you want a truncated result or a rounded result? Why? What values could `a`, `b`, and `c` have? — Eric Postpischil, Jul 20 '21 at 08:13
@Someprogrammerdude the cast is not needed, but it may prevent a warning such as `'initializing': conversion from 'double' to 'short', possible loss of data` with the MS compiler — Jabberwocky, Jul 20 '21 at 08:15
@Jabberwocky I meant in `double result = (double) (a* (b/c));` Perhaps should have phrased it differently to made it clearer? — Some programmer dude, Jul 20 '21 at 08:17
I have the strong impression that you could do this without floating point arithmetic at all. You should tell us what this function is supposed to do and provide some real world sample values for a, b and c. — Jabberwocky, Jul 20 '21 at 08:19
@Jabberwocky This calculation is for finding some days by using some medicine quantity. A is always integer, while B and C has decimal values sometimes (eg - 10.200000, 13.500000, 30.00000). In my case, I'm getting both B and C as 10.20000 and A as 30. The expression should give 30 as a result. Instead it's giving 29. — Tamoghna Purkait, Jul 20 '21 at 08:38
Also the function has to return Short type as per business requirements. When the calculated value comes as 10.5 or 10.2 or 10.9, we need to round it down to 10 always. @Eric Postpischil — Tamoghna Purkait, Jul 20 '21 at 08:40
@TamoghnaPurkait the second comment should do the job: `short days = (short)result` -> `short days = (short)round(result)` — Jabberwocky, Jul 20 '21 at 08:43
@TamoghnaPurkait yes, some rounding is definitely going to happen. Show some examples of `b/c` and the resultt you'd like in `days`. — Jabberwocky, Jul 20 '21 at 08:56
It is impossible for `b` to be 10.2 in your C implementation. It uses a binary format for `double`, and the closest representable value is 10.199999999999999289457264239899814128875732421875. This means that rounding errors have occurred even before `calcDays` is called. For a solution, you must provide more information: What values could `a`, `b`, and `c` ideally have? For example, are `b` and `c` always a multiple of .1? Of .01? Of some other number? — Eric Postpischil, Jul 20 '21 at 09:04
@Eric Postpischil From UI user inputs the values of a,b and c.. b and c can be anything like 1, 2, 10, 20, 100, 300, 124 also can be 10.5, 23.2, 61.2 while a will always be whole number. — Tamoghna Purkait, Jul 20 '21 at 09:12
Please give complete specifications, not just examples. What is the maximum value `a` can have? What is the maximum value `b` can have? What is the maximum value `c` can have? Is the ideal `b` always a multiple of .1? Is the ideal `c` always a multiple of 1? Is the ideal `c` always a multiple of .1? Are `b` and `c` derived directly from user input, as if by `scanf`, or are they computed with other calculations? If the latter, we need to figure out how much error they can have. — Eric Postpischil, Jul 20 '21 at 09:22
@Eric Postpischil a is always multiple of 1, min value 1 max is 365 b and c is always multiple of .1, min value 1.0 max is 1000 a,b and c are coming from UI as service call param — Tamoghna Purkait, Jul 20 '21 at 09:51
Given those constraints, the easiest way to compute the answer may be `return (short) (a * lround(10*b) / lround(10*c));`. Does that do what you want? (Note that, given the constraints, some correct results might not fit in `short`, but that is another matter.) — Eric Postpischil, Jul 20 '21 at 23:52

Eric Postpischil · Accepted Answer · 2021-07-22T02:13:36.737

Given the constraints on a, b, and c, we can compute the desired result as (short) (a * lround(10*b) / lround(10*c)) or as (short) (a * b / c + .00005). (This of course requires that the result be representable in short. That is not guaranteed by the stated limits on a, b, and c.)

In the former, ab/c is equivalent to a•10b/(10c), so we just need to show this is what the expression computes, including that the arithmetic does not suffer from rounding errors. We know b would ideally be a multiple of .1, so 10*b would be an integer. lround(10*b) finds this integer, effectively correcting for any error that occurred in converting a decimal numeral to the double format. Similarly, lround(10*c) finds the ideal value of 10*c. lround returns a value in the long type, so the multiplication and the division are performed with integer arithmetic. Also, the long type is capable of representing the necessary range. (a * lround(10*b) is limited to 3,650,000, and long can represent up to at least 2,147,483,647.) So the multiplication is exact, the division truncates the way we desire.

A proof for the latter follows.

The following assumes the IEEE-754 “double” format is used for double. It is sufficient that, after <float.h> is included, #if DBL_MANT_DIG >= 53 is true.

Conversion of a numeral in a user-provided string to double ought to yield a number with error of at most one unit of least precision (ULP). This is suggested in various places in the C standard (and it is unclear whether certain parts of the text intend to require this). However, let’s assume we have a bad conversion with error up to 1024 ULP.

b and c can be up to 1000, for which the ULP is 2⁻⁴³, so 1024 ULP is 2⁻³³. Thus, if the user enters b, b is b(1+e_b), where |e_b| ≤ 2⁻³³, and c is similarly c(1+e_c). As an integer up to 365, a is of course exactly the a entered by the user.

When a * b is computed, the result is ab(1+e_b)(1+e₀), where e₀ is the error introduced by the multiplication. In any rounding mode, |e₀| is less than 1 ULP. The maximum value at this point is 365,000, for which 1 ULP is 2⁻³⁴.

Then we divide by c, for which the minimum value is 1. The result is ab(1+e_b)(1+e₀)/(c(1+e_c))•(1+e₁), where e₁ is the error introduced the division. Again the maximum value is 365,000, so |e₁| < 2⁻³⁴.

Rearranging, the result is ab/c • (1+e_b)(1+e₀)(1+e₁)/(1+e_c). We can easily see a bound for the error is given when e_b is +2⁻³³, e₀ and e₁ are 2⁻³⁴, and e_c is −2⁻³³. Some work with a calculator shows us the error is less than 3.5•10⁻¹⁰.

Note that ab/c equals 10ab/(10c) and consider the latter. The numerator is an integer, and the denominator is an integer that is at most 10,000. Therefore, the closest the quotient can be to an integer without being an integer is .0001. And the closest the result computed with rounding errors can be to an integer is less than .0001 − 3.5•10⁻¹⁰. Therefore, if we add .00005 to the computed result, it will push every result that would have been an integer (without rounding errors) above that integer but it will not push any result that would not have been an integer above the next integer. Therefore, the integer below a * b / c + .00005 is the desired result, and casting to short provides that integer, if it is it range of short.

Casting from double to short decreasing the result in C

1 Answers1