Computing Standard Deviation in a stream

Question

Using Python, assume I'm running through a known quantity of items I, and have the ability to time how long it takes to process each one t, as well as a running total of time spent processing T and the number of items processed so far c. I'm currently calculating the average on the fly A = T / c but this can be skewed by say a single item taking an extraordinarily long time to process (a few seconds compared to a few milliseconds).

I would like to show a running Standard Deviation. How can I do this without keeping a record of each t?

You may also want to have a look at [numpy](http://numpy.scipy.org/) — Daenyth, Apr 04 '11 at 20:15

Sven Marnach · Answer 1 · 2017-07-12T11:34:19.420

52

As outlined in the Wikipedia article on the standard deviation, it is enough to keep track of the following three sums:

s0 = sum(1 for x in samples)
s1 = sum(x for x in samples)
s2 = sum(x*x for x in samples)

These sums are easily updated as new values arrive. The standard deviation can be calculated as

std_dev = math.sqrt((s0 * s2 - s1 * s1)/(s0 * (s0 - 1)))

Note that this way of computing the standard deviation can be numerically ill-conditioned if your samples are floating point numbers and the standard deviation is small compared to the mean of the samples. If you expect samples of this type, you should resort to Welford's method (see the accepted answer).

edited Jul 12 '17 at 11:34

answered Apr 04 '11 at 20:12

Sven Marnach

574,206
118
941
841

1

Can't `s0` be calculated more simply as `length(samples)`, and `s1` as `sum(samples)`? – Benjamin Apr 04 '11 at 20:20
10

@Benjamin: Of course. But the OP does not want to keep track of `samples`. I chose this syntax to make clear what will be added in each iteration (and for the nice symmetric look of it). – Sven Marnach Apr 04 '11 at 20:27
8

@Benjamin: Sven is showing programmatically that the standard deviation is defined as a function of the zeroth, first, and second moments of your data. – Seth Johnson Apr 04 '11 at 20:59
1

For one sample, (s0 * (s0 - 1)) == 0, so there's division by zero. – XTL May 11 '12 at 06:28

unutbu · Answer 2 · 2017-07-12T11:21:02.543

29

Based on Welford's algorithm:

import numpy as np

class OnlineVariance(object):
    """
    Welford's algorithm computes the sample variance incrementally.
    """

    def __init__(self, iterable=None, ddof=1):
        self.ddof, self.n, self.mean, self.M2 = ddof, 0, 0.0, 0.0
        if iterable is not None:
            for datum in iterable:
                self.include(datum)

    def include(self, datum):
        self.n += 1
        self.delta = datum - self.mean
        self.mean += self.delta / self.n
        self.M2 += self.delta * (datum - self.mean)

    @property
    def variance(self):
        return self.M2 / (self.n - self.ddof)

    @property
    def std(self):
        return np.sqrt(self.variance)

Update the variance with each new piece of data:

N = 100
data = np.random.random(N)
ov = OnlineVariance(ddof=0)
for d in data:
    ov.include(d)
std = ov.std
print(std)

Check our result against the standard deviation computed by numpy:

assert np.allclose(std, data.std())

edited Jul 12 '17 at 11:21

answered Apr 04 '11 at 20:43

unutbu

842,883
184
1,785
1,677

With default ddof that one crashes on the first round, for then n=1 and a division by zero happens with the variance. A fix is to make variance a property so that it is not computed online. – scellus Jul 12 '17 at 10:43
1

For the sake of being overly efficient, and because I'm interested in the math, is it possible to create an "include_multiple" function that can add many values in parallel (exploiting the new set's own mean/std/var)? I tried the approach listed [here](https://blog.birost.com/a?ID=01650-c4cb7a92-a186-4f5e-9ef6-e03164f1e8b2), but it seems to get the std wrong when calculated incrementally versus all at once using numpy (even with ddof=1). – Jackson H Jun 20 '22 at 03:55
Note about that site's implementation - it seems the integer division "//" when calculating incre_avg should be regular division "/", otherwise the mean is also incorrect. – Jackson H Jun 20 '22 at 03:57
Also, is it possible to add *and remove* values incrementally? For example, if I add 100 and want to remove its influence from the existing mean and std, is it possible without storing every value? – Jackson H Jun 22 '22 at 20:39

score 17 · Accepted Answer · edited May 23 '17 at 12:34

17

I use Welford's Method, which gives more accurate results. This link points to John D. Cook's overview. Here's a paragraph from it that summarizes why it is a preferred approach:

This better way of computing variance goes back to a 1962 paper by B. P. Welford and is presented in Donald Knuth’s Art of Computer Programming, Vol 2, page 232, 3rd edition. Although this solution has been known for decades, not enough people know about it. Most people are probably unaware that computing sample variance can be difficult until the first time they compute a standard deviation and get an exception for taking the square root of a negative number.

edited May 23 '17 at 12:34

Community

1
1

answered Apr 04 '11 at 20:14

Alex Reynolds

95,983
54
240
345

1

Reading John Cook's overview, I did not see mention of how to incorporate weighted values. For the method in @SvenMarnach's post, wikipedia (http://en.wikipedia.org/wiki/Standard_deviation#Rapid_calculation_methods) mentions using weighted values for s0, s1, and s2 will produce the correct result. It seems likely, but do you know if this is the same for the Welford's Method? – Steven C. Howell May 22 '15 at 12:37
I'm also curious about weighted. The given solution would seem to give the variance for the entire history of the data stream. For some streaming domains I can see how it would be useful to have the value weighted to represent the variance of recent data points and have older samples die off. – Matt Dotson May 10 '18 at 17:51

Computing Standard Deviation in a stream

3 Answers3

Linked

Related