16

Simple question: Will the conversion from an int, say 100, to a double "round" up or down to the next double or will it always round to the nearest one (smallest delta)?

e.g. for static_cast<double>(100):

Casting from int to double

Which way will it cast if d2 < d1?

Bonus question: Can I somehow force the processor to "round" down or up to the closes double using special functions? As I see it, there's no floor<double> or ceil<double> unfortunately.

glades
  • 3,778
  • 1
  • 12
  • 34
  • 6
    assuming `int` is 32-bit it will always convert to an exact `double` value (again assuming a 64-bit IEEE-754 double). See [Real floating-integer conversions](https://en.cppreference.com/w/c/language/conversion) for the rules in other cases – Alan Birtles Mar 08 '23 at 07:53
  • 6
    For values smaller than ~2^53 (9007199254740992) there's no issue - they can be exactly represented in `double` (see here: https://stackoverflow.com/questions/1848700/biggest-integer-that-can-be-stored-in-a-double). The question remains however for values bigger than that (you'll need a 64bit integer). – wohlstad Mar 08 '23 at 08:03
  • @AlanBirtles That is clear to me. However what I don't know is if this exact double value will be the one with the smallest delta always or if it is rounded up or down. – glades Mar 08 '23 at 08:03
  • @glades what do you mean by _"smallest delta always or if it is rounded up or down"_ ? It's the exact values as the integer, no delta (delta==0) and no need to round. – wohlstad Mar 08 '23 at 08:06
  • 1
    @wohlstad I think you mean by "exactly represented" that a round trip conversion will yield the exact same integer value. Only powers of 2 can be represented exactly in a floating point type. – glades Mar 08 '23 at 08:07
  • 4
    @glades why do you think only powers of 2 can be represented exactly ? A `double` typically has 53 bits of mantisa. You can have an exact representation for at least around 2^53 values, not only powers of 2 (e.g. by setting the exponent to 2^0). – wohlstad Mar 08 '23 at 08:12
  • 7
    No, any integer up to 2^53 can be represented exactly, not just powers of 2, then it's ever other integer up to 2^54, every 4 up to 2^55 etc. – Alan Birtles Mar 08 '23 at 08:12
  • @AlanBirtles Oh my bad sorry of course you can just shift the floating point. – glades Mar 08 '23 at 08:12
  • 1
    @glades That's why it's called "floating". – molbdnilo Mar 08 '23 at 09:15
  • There are `floor()` and `ceil()` functions in the C standard library, which are therefore usable in C++. Their arguments and return values are of type `double`. But **these do not help** because converting an integer that is not exactly representable as a `double` will result in a `double` with a finite decimal representation that has no non-zero fractional digits. That is, the result is already integral, and should not be altered by `floor()` or `ceil()`, even though converting back to an integer type will not reconstitute the original integer. – John Bollinger Mar 08 '23 at 23:09
  • @glades: It’s not about shifting the floating point. It’s about that more than a single bit in the mantissa can be one. – Thomas Padron-McCarthy Mar 09 '23 at 05:20

1 Answers1

14

Note that a 32-bit int can be represented exactly by a 64-bit IEEE 754 double (it can actually represent up to 53-bit integers precisely).

If you're using an integer that is larger than can be represented by your floating point type then the rules at Real floating-integer conversions apply:

  • if the value can be represented, but cannot be represented exactly, the result is the nearest higher or the nearest lower value (in other words, rounding direction is implementation-defined), although if IEEE arithmetic is supported, rounding is to nearest. It is unspecified whether FE_INEXACT is raised in this case.
  • if the value cannot be represented, the behavior is undefined, although if IEEE arithmetic is supported, FE_INVALID is raised and the result value is unspecified.

There is no c++ standard function to control the rounding mode, most implementations will use the IEEE round to nearest.

Alan Birtles
  • 32,622
  • 4
  • 31
  • 60
  • The parenthetical note on the first bullet point is does not mean the same as the preceding text. The preceding text would accommodate the possibility that an implementation might choose in Unspecified fashion between yielding the next higher or next lower value, which might allow some optimizations in cases where both values would work equally well. – supercat Mar 08 '23 at 19:39
  • @supercat not sure what you mean? The quote is directly from cppreference – Alan Birtles Mar 08 '23 at 19:54
  • 1
    @AlanBirtles: It's sloppy writing, wherever it came from. The phrase "in other words" is supposed to appear between pieces of text that mean the same thing, but in different words. If the preceding and following text. If implementations are required to document in all cases how they will choose the rounding direction, the text should say something like "Additionally, the choice of whether to round up or down must be made in implementation-defined fashion", and if implementations are not required to document things that precisely, the choice is Unspecified and the statement that it's... – supercat Mar 08 '23 at 20:17
  • ... "implementation-defined" would be erroneous. – supercat Mar 08 '23 at 20:18
  • @supercat, the way I'm reading it is that if IEEE arithmetic is used, then rounding is to the nearest value using the IEEE rules. If IEEE arithmetic is *not* used, then the implementation can pick either the nearest higher value or the nearest lower value based on whatever rules seem appropriate. – Mark Mar 08 '23 at 23:17
  • @Mark: If the choice is "implementation defined", that would imply that the implementation should document how it would make the choice in each and every situation that should arise, even if e.g. it would be faster for the implementation to use one approach when processing e.g. `double1 = double2+longlong1;` but a different approach when processing e.g. `double1 = 0x1234567812345678 + longlong1;`. – supercat Mar 08 '23 at 23:23
  • @supercat standard wording was "it is an implementation-defined choice of either the next lower or higher representable value" - I've tweaked cppreference to match closer – Cubbi Mar 21 '23 at 14:00
  • @Cubbi: Seems like an annoying bit of over-specification by the Standard, in a place where it would be better to offer a recommendation and a macro to indicate what level of precision is guaranteed. On some platforms, code which merely has to guarantee a result that is less than 1ulp from being correct could be considerably faster than code which has to guarantee finer accuracy than that, and for many tasks improved performance would be very desirable but sub-ulp accuray would offer no benefit whatsoever. – supercat Mar 21 '23 at 14:46