2

I would like to generate numpy arrays with a decimal spacing, e.g. [0.05, 0.10, 0.15...]. The function np.linspace seems appropriate, as the examples include float arrays. However, I cannot understand the results I'm getting:

import numpy as np
x = np.linspace(2110.35, 2149.05, 775)
print(x[-1], type(x[-1]))
print(x[-3], type(x[-3]))

    2149.05 <class 'numpy.float64'>
    2148.9500000000003 <class 'numpy.float64'>

Most values are what I expect, but x[-3] is not. I thought that perhaps 2148.95 can't be stored any more accurately as a np.float64, but this doesn't seem to be true:

y = np.float64(2148.95)
print(y, type(y))

    2148.95 <class 'numpy.float64'>

This causes a problem when I want to use a comparison operator:

x[-3] <= y
    False

Why is this happening?

What can I do to safely obtain the sort of arrays I want?

(I apologize if I misused technical language, I don't know much about floating-point maths)

xxyxxyxyx1
  • 79
  • 7
  • Does this answer your question? [Is floating point math broken?](https://stackoverflow.com/questions/588004/is-floating-point-math-broken) – tomjn Nov 19 '20 at 19:55
  • No it does not. I understand that np.float64(2148.95) is not stored as the true decimal number 2148.95. I don't know why it appears as such in one case and not the other. I don't know why all values of x, except x[-3], print and compare as expected. – xxyxxyxyx1 Nov 19 '20 at 20:06
  • 1
    Equality test (including `<=`) on floats is not 'safe'. This is the float point issue. Look at the range for your values, `2149.05-2110.35`. I get `38.70000000000027` – hpaulj Nov 19 '20 at 21:09
  • @xxyxxyxyx1 -- There is one big reason, in addition to the usual suspect of "inherent limitation of fp-representation". Your input values to `linspace()` are wrong, in the first place, given your expectation of a spacing of `0.05`. Please see my edited answer for a complete set of reasons. – fountainhead Nov 21 '20 at 19:49

2 Answers2

2

Edit 2020-11-21: (Added section starting here)

Reason for discrepancy:

A peek at the source code for numpy.linspace reveals the following:

  1. First, the step is calculated from given start, stop, and num. There's only one way to do this, and there's nothing wrong or imprecise in how step is being computed.
  2. Successive output values are produced by taking the base value of start, and adding increasingly higher multiples of step, to the start value. So, for example, the first value is produced by adding (0 * step) to start, and the next value is produced by adding (1 * step) to start, and so on. Again, nothing wrong or unusual or imprecise in this. Remember that the primary promises of linspace() are to ensure uniformity of the spacing, and to ensure that exactly the requested number of samples are output.
  3. If endpoint=True in the call to linspace(), the last output value is forced to be equal to the specified stop value. Again, for most use-cases, this would seem the more desirable thing to do, rather than outputting a computed value as the last value, no matter how cleverly the last value is computed.

So, where is the imprecision coming from?

I can think of three reasons for the discrepancy being discussed here:

  1. Wrong expectation for the given input values of start, stop and num: Even in an ideal world where there are no inherent limitations in FP representation, it is impossible for x[-3] to equal 2148.95, when start=2110.35, stop=2149.05, and num=775. It may be possible only if these input values were slightly different, for example num=776 instead of num=775 -- in which case (stop - start)/(776-1) would be a nice round spacing of 0.05, which means it would then be reasonable to expect x[-3] to equal 2148.95.
  2. The start and start values themselves happen to be incapable of being precisely represented in FP. (Remember that most fractions can be represented in FP only with some imprecision. The notable exceptions are fractions that are negative powers of 2). So, with start and stop, what we see is not what the computer sees. This obviously affects the computation of step. The imprecision in this computed step is num times less than the imprecision in (stop - start). However, note that towards the end of the output sequence, this imprecision in (stop - start) will have an increasing impact, as increasingly higher multiples of step get added to start. Here's a demonstration of this:
    print (format(2110.35, ".40g"))
    print (format(2149.05, ".40g"))
    print (format(0.125, ".40g"))  # 0.125 is an FP-friendly vaue (it is 2**(-3))

outputs:

    2110.349999999999909050529822707176208496
    2149.050000000000181898940354585647583008
    0.125

  1. (As OP is already aware) Each computed output value (however precisely the computation is done), may or may not be representable precisely in FP. In practice, this inherent imprecision of FP will affect most of the generated output values.

Hypothetically speaking, it could be argued that perhaps numpy developers could have inserted extra interventional logic -- they could have tried to somehow figure out which of the output values can be precisely represented in FP, and for those output positions, they could output forced true values rather than computed values. Such an intervention is currently being done for the last output value, when endpoint=True. However, such excessive intervention would violate the primary promise of regularity of spacing, in addition to affecting the speed.

Another hypothetical debate could be on whether the successive values could be generated by cumulative addition, rather than by (start + i*step). That is, the second value could be computed as (first + step), the third could be computed as (second + step), and so on. I am not sure why such an implementation was not chosen, but I doubt that will get us any benefits in precision. The factors that I suspect to be causing the imprecision now, would still apply (the inherent limitation of FP representation for precisely representing the computed values, and the imprecision in start, stop and step). It will also probably be slower, since each value can be computed only after the previous one has been computed.

To summarize, rather than the linspace algorithm being sub-optimal, the real culpable reasons are -- wrong expectation (or wrong input), and inherent limitation of FP representation affects start, stop, and the computed values.

Edit 2020-11-21: (Added section ends here)

Your little test:

print(y, type(y))

producing the output:

2148.95 <class 'numpy.float64'>

does not necessarily prove that y contains exactly 2148.95.

That's because, when you call print(y), it internally invokes str(y), and we really don't know too much about the precision to which str(y) produces its output.

Suggested Actions:

  1. In the invocation of linspace(), change num=775 to num=776, to match your expectation of step=0.05
  2. To perform test such as if x[-3] <= y:, do something like if round(x[-3],5) <= round(y,5): instead (here, I'm assuming that you care about no more than 5 digits after the decimal point).
fountainhead
  • 3,584
  • 1
  • 8
  • 17
1

I had understood that all decimal values are approximated in floats. However, I expected np.linspace() to provide floats with the closest possible value to the true value. I now understand why this is not how this, nor most other functions behave.

The functions format() and np.nextafter() helped me understand:

x = np.linspace(2110.35, 2149.05, 775)
y = np.float64(2148.95)
print('y:')
print(format(y, '.20g'))
print('x[-3]:')
print(format(x[-3], '.20g'))
print('the next float after y:')
print(format(np.nextafter(y, y+1), '.20g'))

    y:
    2148.9499999999998181
    x[-3]:
    2148.9500000000002728
    next float above y:
    2148.9500000000002728


The result from linspace is not the closest float to the decimal value 2148.95. The floating-point maths done by np.linspace to generate the array introduce errors beyond the minimal inherent decimal-to-float64 error.

Thank you @fountainhead, round() solves my problem by providing the closest float to the true decimal:

print('x[-3] rounded:')
print(format(round(x[-3], 2), '.20g'))

    x[-3] rounded:
    2148.9499999999998181

In my case, where I always have two decimal places, I can continue to use <=. In other situations, I will test for a difference less than some epsilon.

xxyxxyxyx1
  • 79
  • 7