2

This is a question about floating point analysis and numerical stability. Say I have two [d x 1] vectors a and x and a scalar b such that a.T @ x < b (where @ denotes a dot product).

I additionally have a unit [d x 1] vector d. I want to derive the maximum scalar s so that a.T @ (x + s * d) < b. Without floating point errors this is trivial:

s = (b - a.T @ x) / (a.T @ d).

But with floating point errors though this s is not guaranteed to satisfy a.T @ (x + s * d) < b.

Currently my solution is to use a stabilized division, which helps:

s = sign(a.T @ x) * sign(a.T @ d) * exp(log(abs(a.T @ x) + eps) - log(abs(a.T @ d) + eps)).

But this s still does not always satisfy the inequality. I can check how much this fails by:

diff = a.T @ (x + s * d) - b

And then "push" that diff back through: (x + s * d - a.T @ (diff + eps2)). Even with both the stable division and pushing the diff back sometimes the solution fails to satisfy the inequality. So these attempts at a solution are both hacky and they do not actually work. I think there is probably some way to do this that would work and be guaranteed to minimally satisfy the inequality under floating point imprecision, but I'm not sure what it is. The solution needs to be very efficient because this operation will be run trillions of times.

Edit: Here is an example in numpy of this issue coming into play, because a commenter had some trouble replicating this problem.

np.random.seed(1)
p, n = 10, 1
k = 3
x = np.random.normal(size=(p, n))
d = np.random.normal(size=(p, n))
d /= np.sum(d, axis=0)
a, b = np.hstack([np.zeros(p - k), np.ones(k)]), 1
s = (b - a.T @ x) / (a.T @ d)

Running this code gives a case where a.T @ (s * d + x) > b failing to satisfy the constraint. Instead we have:

>>> diff = a.T @ (x + s * d) - b
>>> diff
array([8.8817842e-16])

The question is about how to avoid this overflow.

njwfish
  • 93
  • 11
  • The formula should be `s = (b - a.dot(x))/a.dot(d)`, there would be some extra steps to ensure m̀aximum if the denominator is negative. – Lutz Lehmann Oct 05 '22 at 18:37
  • Good catch yea thats a typo, I left `b -` out in my definition of f. If `s` is negative then things are unbounded in direction `d`. This is not a problem. – njwfish Oct 06 '22 at 08:51
  • This is still wrong, `b - (ax)/(ad)` is different from `(b-ax)/(ad)`. To stay on the same side of the divide, you could replace `s` with `0.99*s` or use a factor closer to 1. – Lutz Lehmann Oct 06 '22 at 09:18
  • youre right sorry! – njwfish Oct 06 '22 at 09:29
  • Do you have an example of input causing the issue? I tried with random [0;1] vectors of size 5 and I got perfect results (though it is quite unusual). I thing the error is dependent of the kind of input and `d` (as for most numerical problems). – Jérôme Richard Oct 08 '22 at 02:09
  • updated with an example using numpy! this is also less of an issue with higher precision floats; I am interested in float32 in particular (though the example uses float64 which is numpy's default). – njwfish Oct 08 '22 at 12:35

1 Answers1

1

The problem you are dealing with appear to be mainly rounding issues and not really numerical stability issues. Indeed, when a floating-point operation is performed, the result has to be rounded so to fit in the standard floating point representation. The IEEE-754 standard specify multiple rounding mode. The default one is typically the rounding to nearest.

This mean (b - a.T @ x) / (a.T @ d) and a.T @ (x + s * d) can be rounded to the previous or nest floating-point value. As a result, there is slight imprecision introduced in the computation. This imprecision is typically 1 unit of least precision (ULP). 1 ULP basically mean a relative error of 1.1e−16 for double-precision numbers.

In practice, every operation can result in rounding and not the whole expression so the error is typically of few ULP. For operation like additions, the rounding tends to mitigate the error while for some others like a subtraction, the error can dramatically increase. In your case, the error seems only to be due to the accumulation of small errors in each operations.

The floating point computing units of processors can be controlled in low-level languages. Numpy also provides a way to find the next/previous floating point value. Based on this, you can round the value up or down for some parts of the expression so for s to be smaller than the target theoretical value. That being said, this is not so easy since some the computed values can be certainly be negative resulting in opposite results. One can round positive and negative values differently but the resulting code will certainly not be efficient in the end.

An alternative solution is to compute the theoretical error bound so to subtract s by this value. That being said, this error is dependent of the computed values and the actual algorithm used for the summation (eg. naive sum, pair-wise, Kahan, etc.). For example the naive algorithm and the pair-wise ones (used by Numpy) are sensitive to the standard deviation of the input values: the higher the std-dev, the bigger the resulting error. This solution only works if you exactly know the distribution of the input values or/and the bounds. Another issues is that it tends to over-estimate the error bounds and gives a just an estimation of the average error.

Another alternative method is to rewrite the expression by replacing s by s+h or s*h and try to find the value of h based on the already computed s and other parameters. This methods is a bit like a predictor-corrector. Note that h may not be precise also due to floating point errors.

With the absolute correction method we get:

h_abs = (b - a @ (x + s * d)) / (a @ d)
s += h_abs

With the relative correction method we get:

h_rel = (b - a @ x) / (a @ (s * d))
s *= h_rel

Here are the absolute difference with the two methods:

Initial method:   8.8817842e-16 (8 ULP)
Absolute method: -8.8817842e-16 (8 ULP)
Relative method: -8.8817842e-16 (8 ULP)

I am not sure any of the two methods are guaranteed to fulfil the requirements but a robust method could be to select the smallest s value of the two. At least, results are quite encouraging since the requirement are fulfilled for the two methods with the provided inputs.

A good method to generate more precise results is to use the Decimal package which provide an arbitrary precision at the expense of a much slower execution. This is particularly useful to compare practical results with more precise ones.

Finally, a last solution is to increase/decrease s one by one ULP so to find the best result. Regarding the actual algorithm used for the summation and inputs, results can change. The exact expression used to compute the difference also matter. Moreover, the result is certainly not monotonic because of the way floating-point arithmetic behave. This means one need to increase/decrease s by many ULP so to be able to perform the optimization. This solution is not very efficient (at least, unless big steps are used).

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59