There are two questions here: what's the right answer, and why is R getting a different answer under different circumstances?
That number 3.999999999999999777999999999999999999999 is obviously very close to 4. In fact, it's closer to 4 than it is to any other IEEE-754 double-precision floating-point number. The next-lower representable number is about 3.9999999999999995, which is a little bit farther away. So, strictly speaking, trunc(3.999999999999999777999999999999999999999)
should be trunc(4.0)
which is obviously 4. That is, when R takes the input 3.999999999999999777999999999999999999999, it should immediately convert it to an internal value of 4, even before trying to truncate it. This looks "wrong", because you and I can plainly see that truncating 3.999… should give 3, but the fact that not every real number is representable in a finite-precision floating-point representation does occasionally lead to anomalies like this. (See also these three questions, which gather SO's canonical answers for these sorts of binary floating-point anomalies.)
For the rest of this answer, we leave the realm of using R and delve into the world of implementing R. (I'm a C programmer, not an R user, and this answer is likely to betray that bias, by being ignorant of any nuances of R. Apologies for that.) But, at any rate, R is written in C, and uses C's double
type for much of its arithmetic. And on the vast majority of popular general-purpose computers today, a C double
is implemented using IEEE-754 double precision, which is why I led off this answer by mentioning that standard.
But why is R getting a different answer depending on how many trailing 0's there are? The answer lies in a function deep down in the R interpreter which is doing the actual conversion of characters typed by the user into internal R data structures.
How might we convert a string like "123.456" into its internal floating-point representation? One way to do it is to temporarily ignore the decimal point and convert it to an integer, resulting in the number 123456
, then count the number of digits after the decimal point, and divide by ten to that power. And, indeed, 123456 ÷ 10³
is 123.456.
But using that strategy, converting
3.99999999999999977799999999999999999999900 is going to involve taking a 42-digit number and dividing it by 1041, while converting 3.999999999999999777999999999999999999999000 is going to involve taking a 43-digit number and dividing it by 1042.
And none of these numbers are going to be exactly representable in binary floating point. They're going to be a little bit off, and that's sometimes going to lead to discrepancies. And in particular, when the numbers are this big, there's no guarantee that a ÷ b will give you exactly the same answer as 10a ÷ 10b.
For the current example, the discrepancy is that one division leads to a number that's closer to 4, and one division leads to a number that's closer to 3.9999999999999995. (And, remember, I'm talking about divisions that are happening in C code deep in the R interpreter, not any divisions you thought you were doing in R.)
There are several additional factors involved here. (In particular, R is using "binary exponentiation" to compute
10N, and that ends up making a difference, too.) I don't have time to write those details up just now; maybe later. Interested readers can consult the file src/main/util.c
in the R source distribution, specifically the function R_strtod5
.
But the take-home lesson is that accurately converting back and forth between binary floating-point numbers and human-readable base-10 representations is hard. Among other things, getting properly-rounded results usually requires doing your calculations in some higher-precision representation, so that you'll have something just barely accurate enough to round off at the end (that is, to yield the desired "properly-rounded result"). Ironically, R's implementation is trying to do the right thing in that respect, computing both numbers (that is, the two numbers to be divided) using C's long double
type. I would have thought that would have been sufficient to avoid anomalies like this, but evidently not.
Also this would be worth reporting as a bug in R. A truly high-quality strtod
implementation won't have anomalies like this, and having gone the route of implementing its own, R is (I would say) on the hook to reinvent whatever wheels are necessary in order to get a properly-rounded result in all cases.