compute mean in python for a generator

Question

I'm doing some statistics work, I have a (large) collection of random numbers to compute the mean of, I'd like to work with generators, because I just need to compute the mean, so I don't need to store the numbers.

The problem is that numpy.mean breaks if you pass it a generator. I can write a simple function to do what I want, but I'm wondering if there's a proper, built-in way to do this?

It would be nice if I could say "sum(values)/len(values)", but len doesn't work for genetators, and sum already consumed values.

here's an example:

import numpy 

def my_mean(values):
    n = 0
    Sum = 0.0
    try:
        while True:
            Sum += next(values)
            n += 1
    except StopIteration: pass
    return float(Sum)/n

X = [k for k in range(1,7)]
Y = (k for k in range(1,7))

print numpy.mean(X)
print my_mean(Y)

these both give the same, correct, answer, buy my_mean doesn't work for lists, and numpy.mean doesn't work for generators.

I really like the idea of working with generators, but details like this seem to spoil things.

You'd know how many random numbers your generator would produce, wouldn't you? — Sven Marnach, Feb 10 '11 at 23:15
@Sven Marnach: suppose the generator is reading from a file? — Jimmy, Feb 10 '11 at 23:22
If you really want to not store the data (and not implement your own slower `sum` function) You could create a counting generator and call it like this: `co = countingGen(); mean = sum(co(data))/co.getCount()` — Thomas Ahle, Feb 10 '11 at 23:27

score 27 · Answer 1 · edited Sep 08 '21 at 17:51

In general if you're doing a streaming mean calculation of floating point numbers, you're probably better off using a more numerically stable algorithm than simply summing the generator and dividing by the length.

The simplest of these (that I know) is usually credited to Knuth, and also calculates variance. The link contains a python implementation, but just the mean portion is copied here for completeness.

def mean(data):
    n = 0
    mean = 0.0
 
    for x in data:
        n += 1
        mean += (x - mean)/n

    if n < 1:
        return float('nan')
    else:
        return mean

I know this question is super old, but it's still the first hit on google, so it seemed appropriate to post. I'm still sad that the python standard library doesn't contain this simple piece of code.

score 8 · Accepted Answer · answered Feb 10 '11 at 23:20

8

Just one simple change to your code would let you use both. Generators were meant to be used interchangeably to lists in a for-loop.

def my_mean(values):
    n = 0
    Sum = 0.0
    for v in values:
        Sum += v
        n += 1
    return Sum / n

answered Feb 10 '11 at 23:20

Mark Ransom

299,747
42
398
622

4

Capital letters such as in Sum are usually reserved for classes. – xApple Sep 05 '13 at 12:36
2

@xApple, I tried to make this similar to the code in the question; you'll see that the variable is named `Sum` there as well. Personally I would have followed the convention in PEP 8. – Mark Ransom Sep 05 '13 at 13:27
4

and `sum` is a built-in so you should use `sum_` or `total` – Aaron McMillin Aug 16 '16 at 20:00

jfs · Answer 3 · 2013-11-26T00:18:32.467

def my_mean(values):
    total = 0
    for n, v in enumerate(values, 1):
        total += v
    return total / n

print my_mean(X)
print my_mean(Y)

There is statistics.mean() in Python 3.4 but it calls list() on the input:

def mean(data):
    if iter(data) is data:
        data = list(data)
    n = len(data)
    if n < 1:
        raise StatisticsError('mean requires at least one data point')
    return _sum(data)/n

where _sum() returns an accurate sum (math.fsum()-like function that in addition to float also supports Fraction, Decimal).

score 3 · Answer 4 · edited Feb 24 '12 at 22:07

3

The old-fashioned way to do it:

def my_mean(values):
   sum, n = 0, 0
   for x in values:
      sum += x
      n += 1
   return float(sum)/n

edited Feb 24 '12 at 22:07

Ryan C. Thompson

40,856
28
97
159

answered Feb 10 '11 at 23:21

Jimmy

89,068
17
119
137

score 1 · Answer 5 · answered Apr 07 '16 at 16:07

1

You can use reduce without knowing the size of the array:

from itertools import izip, count
reduce(lambda c,i: (c*(i[1]-1) + float(i[0]))/i[1], izip(values,count(1)),0)

answered Apr 07 '16 at 16:07

topkara

886
9
15

score 1 · Answer 6 · answered Feb 10 '11 at 23:17

1

One way would be

numpy.fromiter(Y, int).mean()

but this actually temporarily stores the numbers.

answered Feb 10 '11 at 23:17

Sven Marnach

574,206
118
941
841

score 1 · Answer 7 · answered Feb 10 '11 at 23:20

1

Your approach is a good one, but you should instead use the for x in y idiom instead of repeatedly calling next until you get a StopIteration. This works for both lists and generators:

def my_mean(values):
    n = 0
    Sum = 0.0

    for value in values:
        Sum += value
        n += 1
    return float(Sum)/n

answered Feb 10 '11 at 23:20

Adam Rosenfield

390,455
97
512
589

1

Capital letters such as in `Sum` are usually reserved for classes. – xApple Sep 05 '13 at 12:35

score 0 · Answer 8 · answered Jun 25 '15 at 12:38

0

If you know the length of the generator in advance and you want to avoid storing the full list in memory, you can use:

reduce(np.add, generator)/length

answered Jun 25 '15 at 12:38

Quant Metropolis

2,602
2
17
16

Thomas Ahle · Answer 9 · 2011-02-10T23:24:21.170

0

def my_mean(values):
    n = 0
    sum = 0
    for v in values:
        sum += v
        n += 1
    return sum/n

The above is very similar to your code, except by using for to iterate values you are good no matter if you get a list or an iterator. The python sum method is however very optimized, so unless the list is really, really long, you might be more happy temporarily storing the data.

(Also notice that since you are using python3, you don't need float(sum)/n)

edited Feb 10 '11 at 23:24

answered Feb 10 '11 at 23:18

Thomas Ahle

30,774
21
92
114

1

By doing `sum = 0` you are masking builtin functions. – xApple Sep 05 '13 at 12:35

score -1 · Answer 10 · answered Feb 10 '11 at 23:16

-1

Try:

import itertools

def mean(i):
    (i1, i2) = itertools.tee(i, 2)
    return sum(i1) / sum(1 for _ in i2)

print mean([1,2,3,4,5])

tee will duplicate your iterator for any iterable i (e.g. a generator, a list, etc.), allowing you to use one duplicate for summing and the other for counting.

(Note that 'tee' will still use intermediate storage).

answered Feb 10 '11 at 23:16

payne

13,833
5
42
49

2

This temporarily stores the whole list. Memory-wise, it's equivalent to converting to a list first and the using `sum(a)/len(a)`, but using a list would be faster. – Sven Marnach Feb 10 '11 at 23:19
Good point, true -- I was just looking at how tee() is implemented. I hate it when that happens. :-) – payne Feb 10 '11 at 23:21
You would think that `tee` could be implemented by only storing the "diff" between the cloned iterators, i.e. the elements that one has consumed but the other has not yet. – Ryan C. Thompson Feb 24 '12 at 22:08

compute mean in python for a generator

10 Answers10

Linked