18

I'm doing some statistics work, I have a (large) collection of random numbers to compute the mean of, I'd like to work with generators, because I just need to compute the mean, so I don't need to store the numbers.

The problem is that numpy.mean breaks if you pass it a generator. I can write a simple function to do what I want, but I'm wondering if there's a proper, built-in way to do this?

It would be nice if I could say "sum(values)/len(values)", but len doesn't work for genetators, and sum already consumed values.

here's an example:

import numpy 

def my_mean(values):
    n = 0
    Sum = 0.0
    try:
        while True:
            Sum += next(values)
            n += 1
    except StopIteration: pass
    return float(Sum)/n

X = [k for k in range(1,7)]
Y = (k for k in range(1,7))

print numpy.mean(X)
print my_mean(Y)

these both give the same, correct, answer, buy my_mean doesn't work for lists, and numpy.mean doesn't work for generators.

I really like the idea of working with generators, but details like this seem to spoil things.

Paul Hiemstra
  • 59,984
  • 12
  • 142
  • 149
nick maxwell
  • 1,441
  • 3
  • 16
  • 22
  • 2
    You'd know how many random numbers your generator would produce, wouldn't you? – Sven Marnach Feb 10 '11 at 23:15
  • @Sven Marnach: suppose the generator is reading from a file? – Jimmy Feb 10 '11 at 23:22
  • 2
    If you really want to not store the data (and not implement your own slower `sum` function) You could create a counting generator and call it like this: `co = countingGen(); mean = sum(co(data))/co.getCount()` – Thomas Ahle Feb 10 '11 at 23:27

10 Answers10

27

In general if you're doing a streaming mean calculation of floating point numbers, you're probably better off using a more numerically stable algorithm than simply summing the generator and dividing by the length.

The simplest of these (that I know) is usually credited to Knuth, and also calculates variance. The link contains a python implementation, but just the mean portion is copied here for completeness.

def mean(data):
    n = 0
    mean = 0.0
 
    for x in data:
        n += 1
        mean += (x - mean)/n

    if n < 1:
        return float('nan')
    else:
        return mean

I know this question is super old, but it's still the first hit on google, so it seemed appropriate to post. I'm still sad that the python standard library doesn't contain this simple piece of code.

lajarre
  • 4,910
  • 6
  • 42
  • 69
Erik
  • 6,470
  • 5
  • 36
  • 37
8

Just one simple change to your code would let you use both. Generators were meant to be used interchangeably to lists in a for-loop.

def my_mean(values):
    n = 0
    Sum = 0.0
    for v in values:
        Sum += v
        n += 1
    return Sum / n
Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
8
def my_mean(values):
    total = 0
    for n, v in enumerate(values, 1):
        total += v
    return total / n

print my_mean(X)
print my_mean(Y)

There is statistics.mean() in Python 3.4 but it calls list() on the input:

def mean(data):
    if iter(data) is data:
        data = list(data)
    n = len(data)
    if n < 1:
        raise StatisticsError('mean requires at least one data point')
    return _sum(data)/n

where _sum() returns an accurate sum (math.fsum()-like function that in addition to float also supports Fraction, Decimal).

jfs
  • 399,953
  • 195
  • 994
  • 1,670
3

The old-fashioned way to do it:

def my_mean(values):
   sum, n = 0, 0
   for x in values:
      sum += x
      n += 1
   return float(sum)/n
Ryan C. Thompson
  • 40,856
  • 28
  • 97
  • 159
Jimmy
  • 89,068
  • 17
  • 119
  • 137
1

You can use reduce without knowing the size of the array:

from itertools import izip, count
reduce(lambda c,i: (c*(i[1]-1) + float(i[0]))/i[1], izip(values,count(1)),0)
topkara
  • 886
  • 9
  • 15
1

One way would be

numpy.fromiter(Y, int).mean()

but this actually temporarily stores the numbers.

Sven Marnach
  • 574,206
  • 118
  • 941
  • 841
1

Your approach is a good one, but you should instead use the for x in y idiom instead of repeatedly calling next until you get a StopIteration. This works for both lists and generators:

def my_mean(values):
    n = 0
    Sum = 0.0

    for value in values:
        Sum += value
        n += 1
    return float(Sum)/n
Adam Rosenfield
  • 390,455
  • 97
  • 512
  • 589
0

If you know the length of the generator in advance and you want to avoid storing the full list in memory, you can use:

reduce(np.add, generator)/length
Quant Metropolis
  • 2,602
  • 2
  • 17
  • 16
0
def my_mean(values):
    n = 0
    sum = 0
    for v in values:
        sum += v
        n += 1
    return sum/n

The above is very similar to your code, except by using for to iterate values you are good no matter if you get a list or an iterator. The python sum method is however very optimized, so unless the list is really, really long, you might be more happy temporarily storing the data.

(Also notice that since you are using python3, you don't need float(sum)/n)

Thomas Ahle
  • 30,774
  • 21
  • 92
  • 114
-1

Try:

import itertools

def mean(i):
    (i1, i2) = itertools.tee(i, 2)
    return sum(i1) / sum(1 for _ in i2)

print mean([1,2,3,4,5])

tee will duplicate your iterator for any iterable i (e.g. a generator, a list, etc.), allowing you to use one duplicate for summing and the other for counting.

(Note that 'tee' will still use intermediate storage).

payne
  • 13,833
  • 5
  • 42
  • 49
  • 2
    This temporarily stores the whole list. Memory-wise, it's equivalent to converting to a list first and the using `sum(a)/len(a)`, but using a list would be faster. – Sven Marnach Feb 10 '11 at 23:19
  • Good point, true -- I was just looking at how tee() is implemented. I hate it when that happens. :-) – payne Feb 10 '11 at 23:21
  • You would think that `tee` could be implemented by only storing the "diff" between the cloned iterators, i.e. the elements that one has consumed but the other has not yet. – Ryan C. Thompson Feb 24 '12 at 22:08