8

I am doing some floating point calculations and the results are not as accurate as I want them to be.

This is the algorithm:

...
center = (max_x + min_x) / 2
distance = old_x - center
new_x = center + (distance * factor)

return new_x

min_x, max_x, and old_x are all floats. I believe that the greatest error is introduced when I'm taking the average of the max and the min, and then the error is multiplied by the factor (which can be a float).

How can I minimize the error due to FP computation so that new_x is as precise as it can be?

Phil Miller
  • 36,389
  • 13
  • 67
  • 90
MxLDevs
  • 19,048
  • 36
  • 123
  • 194
  • Generally you should _add_ floats in order from smallest to largest (in absolute value), so you could expand the expression, sort and then sum. – Kerrek SB Jun 24 '11 at 17:24
  • Increasing the precisions (e.g. from a `float` to a `double` or something unbound, etc.) for intermediate calculations can also minimize errors. –  Jun 24 '11 at 17:26
  • 1
    I bet this is not what you're looking for. But according to my experience the error should be really small... Have you tried replacing the 2 with a 2.0? – iolo Jun 24 '11 at 17:26
  • Error depends on range of all values if you are referring screen coordinates then error would be minimal as iolo said, still depends on factor. if coordinates in floating point 0-1 space then avoid smaller or bigger intermediate results. – Dileep Jun 24 '11 at 17:35
  • In a typical floating-point representation, dividing by 2 does not decrease precission at all, unless you are at denormalised values, very close to 0 (for `floats` its around 1e-38). I don't think your problem comes from there. – CygnusX1 Jun 24 '11 at 18:53

5 Answers5

4

If old_x and center are close then you're losing precision.

It's called Loss of significance

You could change the calculation so the subtraction happenS in the end:

center = (max_x + min_x) / 2
new_x = (center + (old_x * factor)) - (center * factor)
Yochai Timmer
  • 48,127
  • 24
  • 147
  • 185
  • Can you tell us why is this better? – Karoly Horvath Jun 24 '11 at 17:36
  • 2
    the precision losing was clear to me.. I'm asking why your solution is any better? – Karoly Horvath Jun 24 '11 at 17:45
  • probably could do some more manipulations with the devision... have you tried it ? – Yochai Timmer Jun 24 '11 at 17:45
  • 1
    It's supposed to be better because you're not multiplying the error by the factor. This way you're calculating both sides of the subtraction, and then doing it once... so you only have the subtraction error once – Yochai Timmer Jun 24 '11 at 17:48
  • What if factor was less than 1? Then the first version would be more accurate? – Mikola Jun 24 '11 at 19:39
  • You would still already have a loss of significance for one more operation. In your code the loss of significance affects the + operation and * operation. In the code I the subtraction is last, so it affects only one operation. – Yochai Timmer Jun 24 '11 at 19:46
  • And btw, not necessarily, you already have an error after the subtraction, the digits you lost might have been important for the multiplication. So the fact that it's smaller than 1 doesn't mean the error will be smaller. – Yochai Timmer Jun 24 '11 at 19:58
  • This is still wrong, due to adding the two numbers in the first step. – Phil Miller Jun 24 '11 at 22:12
  • Depends... if both number are positive it shouldn't really matter. @Keikoku : min_x and max_x are positive ? – Yochai Timmer Jun 25 '11 at 08:59
  • I doubt whether this is any better. I agree the cause is probably the loss of accuracy suffered when subtracting `center` from `old_x`. However, in the proposed version, the magnitude of the terms you subtract is `factor` larger so the loss of precision due to the subtraction will also be `factor` larger. In other words, it seems you gain nothing. – Jitse Niesen Jun 25 '11 at 12:24
  • The magnitude isn't important, that accuracy is, it's just a matter of how many calculations you do after you already have an error (how many other calculations are affected by the error). The multiplication is precise, so you only get the error after the subtraction. – Yochai Timmer Jun 25 '11 at 12:28
  • @Yochai Timmer, the min and max could be negative. – MxLDevs Jun 27 '11 at 18:00
  • @Yochai: I don't see why the magnitude is not important. The error commited when subtracting two numbers of magnitude M is approximately machine epsilon times M. – Jitse Niesen Jun 29 '11 at 12:32
  • You're loosing segnificant numbers here... for example 2.5555555558-5555555557=0.0000000001 ... so from a number that had 10 significant digits, you are only left with one. So if you multiply it by X it will be (1*10^-10)*X ... The error here is from the fact that you're only multiplying 1 digit by X, you no longer have the accuract expect from a 10 digit number. The accuracy is lost, no matter what X is. – Yochai Timmer Jun 29 '11 at 16:05
2

Depending on your language, there is probably a fixed/arbitrary precision numeric type you can use such as decimal in python or BigDecimal in Java.

richq
  • 55,548
  • 20
  • 150
  • 144
limscoder
  • 3,037
  • 2
  • 23
  • 37
  • 1
    Or, perhaps, just a wider type, such as a `double` -- which has quite a bit more relative precision than a `float`. (C# has `decimal`, which while still a relative floating point type, makes a `double` look like a tiny little inaccurate type.) Fixed precision numeric types often happen to be ... quite slow. –  Jun 24 '11 at 17:32
  • Fixed precision types are indeed slow, but you can specify exactly how you want them to work, and they will work that way cross-platform with no issues. This is really a language dependent question, because float type precision is cross-platform for some languages (JVM, CLR), but not others. – limscoder Jun 24 '11 at 17:47
1

All the previous implementations do not use rounding and thus have a large error: Here's how to do this in fixed point math: I'm using X.1u prevision (1 LSB is used for fraction part).

//center = (max_x + min_x) / 2
center = max_x + min_x // zero error here

// distance = old_x - center
distance = (old_x << 1) - center // zero error here

//new_x = center + (distance * factor)
new_x = (**1** + center + (distance * factor)) >> 1

return new_x

If factor is a fixed point (integer) too with N bits describing the fraction then new_x can be calculated as:

new_x = ( (1 << N) + (center << N) + (distance * factor) ) >> (N + 1)
  • (center << N) has N+1 fraction bits
  • distance * factor has N+1 fraction bits
  • (1 << N) is a 'half' as 1 << (N+1) is 'one' in the above fixed point precision.

After understanding each part, the above line can be compacted:

new_x = ( ((1 + center) << N) + (distance * factor) ) >> (N + 1)

The used integer type should be large enough, off course. If the valid range is unknown, one should check the input to this function and something else. In most cases this isn't needed.

This is as good as it get in fixed point math. This is how HW circuits perform integer math operations.

egur
  • 7,830
  • 2
  • 27
  • 47
1

This eliminates at least one source of error from your original algorithm:

# Adding min and max can produce a value of larger magnitude, losing some low-order bits
center = min_x + (max_x - min_x)/2
distance = old_x - center
new_x = center + (distance * factor)

return new_x

If you have more knowledge of the relationship between old_x, min_x andmax_x, you can probably do better than this.

Phil Miller
  • 36,389
  • 13
  • 67
  • 90
1

As Yochai says, your problem is probably caused by the subtraction old_x - center. If old_x and center are close to each other then you lose precision.

The simple solution would be do to the computation using double instead of float, but I guess that's not possible. In that case, you need to get rid of the subtraction. One possibility is

distance_max = max_x - center
distance_min = min_x - center
distance = (distance_max + distance_min) / 2
new_x = center + factor * distance

This helps if max_x, min_x and center are quite far apart while the average of max_x and min_x is close to center. If that does not help, perhaps you can adapt the computation of max_x so that you actually compute max_x - center but that needs changes in the part you did not show us.

Jitse Niesen
  • 4,492
  • 26
  • 23
  • max_x is computed using max(x1, x2, x3, ..., xn). – MxLDevs Jun 27 '11 at 01:17
  • Don't take the average of the two distances that way. Add half their difference to the minimum of the pair, so that the low-order bits have a chance of making it through. – Phil Miller Jun 30 '11 at 03:46