The problem you are dealing with appear to be mainly rounding issues and not really numerical stability issues. Indeed, when a floating-point operation is performed, the result has to be rounded so to fit in the standard floating point representation. The IEEE-754 standard specify multiple rounding mode. The default one is typically the rounding to nearest.
This mean (b - a.T @ x) / (a.T @ d)
and a.T @ (x + s * d)
can be rounded to the previous or nest floating-point value. As a result, there is slight imprecision introduced in the computation. This imprecision is typically 1 unit of least precision (ULP). 1 ULP basically mean a relative error of 1.1e−16 for double-precision numbers.
In practice, every operation can result in rounding and not the whole expression so the error is typically of few ULP. For operation like additions, the rounding tends to mitigate the error while for some others like a subtraction, the error can dramatically increase. In your case, the error seems only to be due to the accumulation of small errors in each operations.
The floating point computing units of processors can be controlled in low-level languages. Numpy also provides a way to find the next/previous floating point value. Based on this, you can round the value up or down for some parts of the expression so for s
to be smaller than the target theoretical value. That being said, this is not so easy since some the computed values can be certainly be negative resulting in opposite results. One can round positive and negative values differently but the resulting code will certainly not be efficient in the end.
An alternative solution is to compute the theoretical error bound so to subtract s
by this value. That being said, this error is dependent of the computed values and the actual algorithm used for the summation (eg. naive sum, pair-wise, Kahan, etc.). For example the naive algorithm and the pair-wise ones (used by Numpy) are sensitive to the standard deviation of the input values: the higher the std-dev, the bigger the resulting error. This solution only works if you exactly know the distribution of the input values or/and the bounds. Another issues is that it tends to over-estimate the error bounds and gives a just an estimation of the average error.
Another alternative method is to rewrite the expression by replacing s
by s+h
or s*h
and try to find the value of h
based on the already computed s
and other parameters. This methods is a bit like a predictor-corrector. Note that h
may not be precise also due to floating point errors.
With the absolute correction method we get:
h_abs = (b - a @ (x + s * d)) / (a @ d)
s += h_abs
With the relative correction method we get:
h_rel = (b - a @ x) / (a @ (s * d))
s *= h_rel
Here are the absolute difference with the two methods:
Initial method: 8.8817842e-16 (8 ULP)
Absolute method: -8.8817842e-16 (8 ULP)
Relative method: -8.8817842e-16 (8 ULP)
I am not sure any of the two methods are guaranteed to fulfil the requirements but a robust method could be to select the smallest s
value of the two. At least, results are quite encouraging since the requirement are fulfilled for the two methods with the provided inputs.
A good method to generate more precise results is to use the Decimal
package which provide an arbitrary precision at the expense of a much slower execution. This is particularly useful to compare practical results with more precise ones.
Finally, a last solution is to increase/decrease s
one by one ULP so to find the best result. Regarding the actual algorithm used for the summation and inputs, results can change. The exact expression used to compute the difference also matter. Moreover, the result is certainly not monotonic because of the way floating-point arithmetic behave. This means one need to increase/decrease s
by many ULP so to be able to perform the optimization. This solution is not very efficient (at least, unless big steps are used).