In C and Objective-C, what really is the right way to truncate a float or double to an integer?

Question

I worked mostly with integers before, and in situations where I need to truncate a float or double to an integer, I would use the following before:

(int) someValue

except until I found out the following:

NSLog(@"%i", (int) ((1.2 - 1) * 10));     // prints 1
NSLog(@"%i", (int) ((1.2f - 1) * 10));    // prints 2

(please see Strange behavior when casting a float to int in C# for the explanation).

The short question is: how should we truncate a float or double to an integer properly? (Truncation is wanted in this case, not "rounding"). Or, we may say that since one number is 1.9999999999999 and the other is 2.00000000000001 (roughly speaking), the truncate is actually done correctly. So the question is, how should we convert a float or double so that the result is a "truncated" number that makes common usage sense?

(the intention is not to use round, because in this case, for 1.8, we do want the result of 1, instead of 2)

Longer question:

I used

int truncateToInteger(double a) {
    return (int) (a + 0.000000000001);
}

-(void) someTest {
    NSLog(@"%i", truncateToInteger((1.2 - 1) * 10));
    NSLog(@"%i", truncateToInteger((1.2f - 1) * 10));
}

and both print out as 2, but it seems too much of a hack, and what small number should we use to "remove the inaccuracy"? Is there a more standard or studied way, instead of such an arbitrary hack?

(Note that we want truncation, not rounding in some usage, for example, say, if the number of seconds is 90 or 118, when we show how many minutes and how many seconds have elapsed, the minute should display as 1, but should not be rounded up to 2)

yes, floor has the same issue (first using floor, and then cast to int) — nonopolarity, Jun 28 '12 at 12:38

Steve Jessop · Answer 1 · 2012-06-29T09:42:45.420

12

The truncate has been performed correctly, of course, but on an inaccurate intermediate value.

In general there's no way to know whether your 1.999999 result is a slightly inaccurate 2 (so the exact-maths result after truncation is 2), or a slightly inaccurate 1.999998 (so the exact-maths result after truncation is 1).

For that matter, for some calculations you could get 2.000001 as a slightly inaccurate 1.999998. Pretty much whatever you do, you'll get that one wrong. Truncation is a non-continuous function, so however you do it, it makes your overall computation numerically unstable.

You could add an arbitrary tolerance anyway: (int)(x > 0 ? x + epsilon : x - epsilon). It may or my not help, depending what you're doing, which is why it's a "hack". epsilon could be a constant, or it could scale according to the size of x.

The most common solution to your second question isn't to "remove the inaccuracy", rather to accept the inaccurate result as if it were accurate. So, if your floating point unit says that (1.2-1)*10 is 1.999999, OK, it is 1.999999. If that value represents a number of minutes then it truncates to 1 minute 59 seconds. Your final displayed result will be 1s off the true value. If you need a more accurate final displayed result than that, then you shouldn't have used floating-point arithmetic to compute it, or perhaps you should have rounded to the nearest second before truncating to minutes.

Any attempt to "remove" inaccuracy from a floating-point number is actually just going to move inaccuracy around - some inputs will give more accurate results, others less accurate. If you're lucky enough to be in a case where the the inaccuracy is shifted to inputs you don't care about, or can filter out before doing the computation, then you win. In general though, if you have to accept any input then you're going to lose somewhere. You need to look at how to make your computation more accurate, rather than trying to remove inaccuracy in a truncation step at the end.

There's a simple correction for your example computation -- use fixed-point arithmetic with one base-10 decimal place. We know that format can accurately represent 1.2. So, instead of writing (1.2 - 1) * 10, you should rescale the computation to use tenths (write (12 - 10) * 10) and then divide the final result by 10 to scale it back to units.

edited Jun 29 '12 at 09:42

answered Jun 28 '12 at 12:44

Steve Jessop

273,490
39
460
699

I +1'd your answer because it gets to the core of the issue. If OP really wants to treat numbers "very close" to the next integer as being the next integer up, something like multiplying by `(1+DBL_EPSILON*K)` before truncating might work, where `K` is a small constant roughly proportional to the number of roundings the original value was subjected to. – R.. GitHub STOP HELPING ICE Jun 28 '12 at 12:48
what should `epsilon` be? I think using `0.0000001` could be ok, but next time in another program, if I use `0.00000000000000000000000001` or some even smaller number, it may not be enough to "overcome the inaccuracy"... so what is a good way to handle it? – nonopolarity Jun 28 '12 at 12:53
@動靜能量: Like R.. says, it depends on the computation you did. – Steve Jessop Jun 28 '12 at 12:54
I get the idea... but if we need to write `truncateToInteger`, the answer won't be "it depends", right? – nonopolarity Jun 28 '12 at 12:58
2

@動靜能量: if you need to write `int truncateToIntegerAccountingForErrorsInAnUnknownPreviousComputation(double)`, then you have a problem similar to the problem you have when you need to write `justReadTheUsersFreakingMind`. If there was one universal "best" way to do it, then that's what `(int)` would be defined to do. There isn't. – Steve Jessop Jun 28 '12 at 13:00
I think it is more like "accounting for float and double inaccuracy" and it is not as far fetched – nonopolarity Jun 28 '12 at 13:04
Like I said, epsilon should be based on `DBL_EPSILON` (or `FLT_EPSILON` if you're using floats instead of doubles). – R.. GitHub STOP HELPING ICE Jun 28 '12 at 13:32
@R so `DBL_EPSILON` is in fact some standard... not just an arbitrary constant we define... – nonopolarity Jun 28 '12 at 13:39
so specifically, if the calculation is `((1.2 - 1) * 10)`, and now it should be truncated, then what value should it be multiplied before the truncation? That is, for `(1+DBL_EPSILON*K)`, what should `K` be? – nonopolarity Jun 29 '12 at 01:03
Well, `x` could in principle be `x*DBL_EPSILON` too small (in fact it isn't for `x = 1.2`). Subtracting a power of 2 (`1 = 2^0`) of magnitude within the digits represented in the significand of `x` doesn't introduce any additional error. So I think you should allow for the final result to be about `12*DBL_EPSILON` too small. Which means `K` would be 6, since you're going to multiply a number close to `2` by `1 + K*EPSILON`. Or maybe `7`, to account for the inaccuracies in my rough calculation of the inaccuracies. – Steve Jessop Jun 29 '12 at 08:35
+1 to Steve Jessop, you can't undo the floating point rounding "error" that easily because once you losed information, it's too late, and writing "float x = 1.2f;" or "double y = 1.2;" is already too late. If you knew the exact sequence of inexact operations that lead to the approximate result, then you might recover the error like in Shewchuck http://www-2.cs.cmu.edu/afs/cs/project/quake/public/papers/robust-arithmetic.ps , but then I'd say OP must question why using floating point at all? – aka.nice Aug 13 '12 at 14:28

Eric Postpischil · Answer 2 · 2012-07-02T11:11:52.583

As you have modified your question, the problem now seems to be this: Given some inputs x, you calculate a value f'(x). f'(x) is the calculated approximation to an exact mathematical function f(x). You want to calculate trunc(f(x)), that is, the integer i that is farthest from zero without being farther from zero than f(x) is. Because f'(x) has some error, trunc(f'(x)) might not equal trunc(f(x)), such as when f(x) is 2 but f'(x) is 0x1.fffffffffffffp0. Given f'(x), how can you calculate trunc(f(x))?

This problem is impossible to solve. There is no solution that will work for all f.

The reason there is no solution is that, due to the error in f', f'(x) might be 0x1.fffffffffffffp0 because f(x) is 0x1.fffffffffffffp0, or f'(x) might be 0x1.fffffffffffffp0 because of calculation errors even though f(x) is 2. Therefore, given a particular value of f'(x), it is impossible to know what trunc(f(x)) is.

A solution is possible only given detailed information about f (and the actual operations used to approximate it with f'). You have not given that information, so your question cannot be answered.

Here is a hypothesis: Suppose the nature of f(x) is such that its results are always a non-negative multiple of q, for some q that divides 1. For example, q might be .01 (hundredths of a coordinate value) or 1/60 (represent units of seconds because f is in units of minutes). And suppose the values and operations used in calculating f' are such that the error in f' is always less than q/2.

In this very limited, and hypothetical, case, then trunc(f(x)) can be calculated by calculating trunc(f'(x)+q/2). Proof: Let i = trunc(f(x)). Suppose i > 0. Then i <= f(x) < i+1, so i <= f(x) <= i+1-q (because f(x) is quantized by q). Then i-q/2 < f'(x) < i+1-q+q/2 (because f'(x) is within q/2 of f(x)). Then i < f'(x)+q/2 < i+1. Then trunc(f'(x)+q/2) = i, so we have the desired result. In the case where i = 0, then -1 < f(x) < 1, so -1+q <= f(x) <= 1-q, so -1+q-q/2 < f'(x) < 1-q+q/2, so -1+q < f'(x)+q/2 < 1, so trunc(f'(x)+q/2) = 0.

(Note: If q/2 is not exactly representable in the floating-point precision used or cannot be easily added to f'(x) without error, then some adjustments have to be made in either the proof, its conditions, or the addition of q/2.)

If that case does not serve your purpose, then you cannot expect an answer expect by providing detailed information about f and the operations and values used to calculate f'.

score 1 · Answer 3 · answered Jun 28 '12 at 12:43

1

The 'hack' is the proper way to do it. It's simple how floats work, if you want more sane decimal behavior NSDecimal(Number) might be what you want.

answered Jun 28 '12 at 12:43

Hampus Nilsson

6,692
1
25
29

Is there NSDecimal? I can only find NSDecimalNumber https://developer.apple.com/library/mac/#documentation/Cocoa/Reference/Foundation/Classes/NSDecimalNumber_Class/Reference/Reference.html – nhahtdh Jun 28 '12 at 12:46
1

NSDecimal is the struct version, it's documented here: https://developer.apple.com/library/mac/#documentation/Cocoa/Reference/Foundation/Miscellaneous/Foundation_DataTypes/Reference/reference.html – Hampus Nilsson Jun 28 '12 at 12:47
do they have a floor or truncate function? – nonopolarity Jun 28 '12 at 12:55
if the hack is actually what we want, then we would search for all the `(int)` in our code and change it to `truncateToInteger( value )`, it seems – nonopolarity Jun 28 '12 at 13:43
Yes, that is the common solution to the problem. – Hampus Nilsson Jun 28 '12 at 15:31
It's only a solution with decimal numbers. If you calculate 4/3rds = 1.333333, subtract 1, multiply by 12, decimal numbers don't help. – gnasher729 Feb 28 '14 at 23:54

inVader · Answer 4 · 2012-08-01T16:49:40.717

1

I would suggest that in general you should never expect your result to have higher precision than your input. So in your example, your float has one decimal place and you shouldn't need to take your result any more serious than that.

So how about rounding to one decimal place, then converting to int?

float a = (1.2f - 1) * 10;
int b;

// multiply by 10 to "round to one decimal place"
a = round( a * 10. );

// now cast to integer first to avoid further decimal errors
b = (int) a;

// get rid of the factor 10 again by integer division
b = b / 10;

// now 'b' should hold the result you're expecting;

edited Aug 01 '12 at 16:49

answered Jun 28 '12 at 14:03

inVader

1,504
14
25

round at a certain decimal point, and then truncate, that's an interesting thinking... (and hopefully, there won't be a case where a number is rounded to 2.0 or 1384729487.0, but truncated to something unexpected) – nonopolarity Jun 28 '12 at 14:35
By rounding to your initial precision you get rid of the errors at the small decimal places, no matter what they are. As you multiply by 10 in your initial statement you should always have a multiple of 10 after the rounding, so the integer division by 10 is safe in any way. But even if your expected result has decimals and should be 1.9 instead of 2 you will get b = 19/10 which will be one as you want it to. I do not see any problems lurking in the dark here. – inVader Jun 28 '12 at 16:37

Parag Bafna · Answer 5 · 2012-06-28T14:20:27.073

1

NSLog(@"%i", [[NSNumber numberWithFloat:((1.2 - 1) * 10)] intValue]); //2
NSLog(@"%i", [[NSNumber numberWithFloat:(((1.2f - 1) * 10))] intValue]); //2 
NSLog(@"%i", [[NSNumber numberWithFloat:1.8] intValue]); //1
NSLog(@"%i", [[NSNumber numberWithFloat:1.8f] intValue]); //1
NSLog(@"%i", [[NSNumber numberWithDouble:2.0000000000001 ] intValue]);//2

edited Jun 28 '12 at 14:20

answered Jun 28 '12 at 14:14

Parag Bafna

22,812
8
71
144

But it will break down when it is `NSLog(@"%i", [[NSNumber numberWithDouble:((1.2 - 1) * 10)] intValue]);` which prints `1` – nonopolarity Jun 28 '12 at 14:29
Take a look at http://www.koders.com/objectivec/fid92844506886795F3D6CA4417B418B12A1F7EB99B.aspx – Parag Bafna Jun 28 '12 at 14:30

Tammo Freese · Answer 6 · 2012-06-28T12:43:33.580

0

You have to calculate which errors you expect, and then it is safe to add that for your truncation. For example, you said 1.8 should be mapped to 1. What about 1.9? What about 1.99? If you know that in your problem domain you can't get anything bigger than 1.8, it is safe to add 0.001 for making the truncation work.

edited Jun 28 '12 at 12:43

answered Jun 28 '12 at 12:37

Tammo Freese

10,514
2
35
44

score 0 · Answer 7 · answered Jun 28 '12 at 13:13

The right way to do it is: Identify each floating-point operation you perform. This includes converting decimal numerals to floating-point (such as “1.2” in the source text producing the floating-point value 0x1.3333333333333p0 or “1.2f” producing 0x1.333334p0). Determine a limit on the error each operation can produce. (For elementary operations defined by IEEE 754, such as simple arithmetic, this limit is 1/2 ULP [unit of least precision] of the mathematically exact result of the actual input. For conversion from decimal numerals to binary floating-point, the language specification may allow 1 ULP, but good compilers will limit it to 1/2 ULP. For library routines providing complicated functions like sine or logarithm, commercial libraries typically allow several ULP of error, although they are often better within base intervals. You will need to get a specification from the library vendor.) Determine bounds on the final error, with a mathematical proof. If you are able to prove that, for some error bound e, when the exact mathematical result is some integer i, the actually calculated result is in the half-open interval [i-e, i+1-e), then you can produce the exact mathematical result by adding e to the calculated result and truncating the result of that calculation to an integer. (I have omitted certain complications for brevity. One is the problem that adding e might cause rounding up to i+1. Another is avoiding false positives, that is, avoiding producing i when the result is not i, possibly because the final error when the actual result is not i might put the calculated result in [i-e, i+1-e).)

As you can see, the “right” way is, in general, very difficult. For complicated code, proofs are produced only in limited high-value circumstances, such as designing high-quality library routines to calculate the standard math library functions (sine, logarithm, et cetera).

For simple code, a proof might be simple. If you know the answer should be exactly an integer, and you know you have not done so many floating-point operations that the error cannot have become as large as .5, then a right way to produce the correct answer is simply to add .5 and truncate. There is nothing wrong with this, because it is provably correct. (Actually, it is not just the number of operations that you perform but their nature. Subtraction of values with similar magnitudes is notorious for producing errors for which the relative error is great. Multiplying such a result by a large magnitude may produce a large absolute error.)

If you do not know the mathematically correct answer is exactly an integer, then truncating is wrong. If you do not know a bound on the error of your calculations, then adding any correction before truncating is wrong. There is no general answer to this problem; you must understand your computations.

thanks, but please note it is not to "round"... it is to purely truncate — nonopolarity, Jun 28 '12 at 13:18
actually, most calculations I do, the number represented in float or double is often less than 1000 or 2000 (the coordinates), and almost always less than a 32 bit integer value... — nonopolarity, Jun 28 '12 at 13:32
The type of "truncation" you're asking for **is not truncation**. It's a type of rounding. And it does not exist. — R.. GitHub STOP HELPING ICE, Jun 28 '12 at 13:33
The only mention of rounding in my answer is in mentioning the fact that adding a “correction” might cause undesired rounding. My answer does not advocate rounding. It discusses the production of a correct answer. To use floating-point correctly, you MUST understand it; you cannot apply a “one-size-fits-all” correction factor and expect it to work in all circumstances. — Eric Postpischil, Jun 28 '12 at 13:33
I am not asking for truncation? Some note is added to the original question, about, say if number of seconds is 110, then the "minute" should be displayed as "1", and that's truncation, isn't it? and I definitely don't want this "1" or "2" to be displayed as "0" or "1" instead — nonopolarity, Jun 28 '12 at 13:34

In C and Objective-C, what really is the right way to truncate a float or double to an integer?

7 Answers7