Does Python have a bug in implementation of np.std for large arrays?

Question

I am trying to calculate variances via np.std(array,ddof = 0). The problem emerges if I happen to have a long delta array, i.e., all values in the array are the same. Instead of returning std = 0, it gives some small value which in turn causes further estimation errors. The mean is returned correctly... Example:

np.std([0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1],ddof = 0)

gives 1.80411241502e-16

but

np.std([0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1],ddof = 0)

gives std = 0

Is there a way to overcome this except for now checking for uniqueness of data for every iteration without calculating std at all?

Thanks

P.S. Following marking as duplication of Is floating point math broken?, copy-pasting the reply by @kxr on why it's a different question:

"The current duplicate marking is wrong. Its not just about simple float comparison, but about internal aggregation of small errors for near-zero outcome by using the np.std on long arrays - as the questioner indicated extra. Compare e.g. >>> np.std([0.1, 0.1, 0.1, 0.1, 0.1, 0.1]*200000) -> 2.0808632594793153e-12 . So he can e.g. solve by: >>> mean = a.mean(); xmean = round(mean, int(-log10(mean)+9)); std = np.sqrt(((a - xmean) ** 2).sum()/ a.size)"

The problem certainly starts with floating representation but it does not stop there. @kxr - I appreciate the comment and the example

First rule of programming: [assume it is your own error first](http://blog.codinghorror.com/the-first-rule-of-programming-its-always-your-fault/). — Martijn Pieters, Feb 23 '16 at 09:57
The current duplicate marking is wrong. Its not just about simple float comparision, but about internal aggregation of small errors for near-zero outcome by using the np.std on **long** arrays - as the questioner indicated extra. Compare e.g. `>>> np.std([0.1, 0.1, 0.1, 0.1, 0.1, 0.1]*200000) -> 2.0808632594793153e-12` . So he can e.g. solve by: `>>> mean = a.mean(); xmean = round(mean, int(-log10(mean)+9)); std = np.sqrt(((a - xmean) ** 2).sum()/ a.size)` — kxr, Feb 23 '16 at 10:34
@Carlos - the std should be 0. The string is just a long list of repetitions of 0.1 or any other float/double of your choice. As kxr pointed out, multiplying the float by a large number does not solve the problem. — user3861925, Feb 23 '16 at 11:22

score 4 · Accepted Answer · answered Feb 23 '16 at 10:05

Welcome to the world of practical numerical algorithms! In real life, if you have two floating point numbers x and y, checking x == y is meaningless. Thus there is no meaning to the question of whether the standard deviation is 0 or not, it is either close to it or not. Let's check it using np.isclose

import numpy as np

>>> np.isclose(1.80411241502e-16, 0)
True

That's the best you can hope for, effectively. In a real life situation, you can't even check if all your items are the same, as you suggest. Are they floating points? Were they generated by some other process? They will have small errors too.

Good point. In the real world of the real problem that I am trying to solve, the question of comparison is legitimate and crucial, but as you pointed out numerical algs are just numerics and sigma 0 is just sigma 0. Thanks! — user3861925, Feb 23 '16 at 10:09

Does Python have a bug in implementation of np.std for large arrays?

1 Answers1