2

When working with generators you can only pull out items on a single pass. An alternative is to load the generator into an list and do multiple passes but this involves a hit on performance and memory allocation.

Can anyone think of a better way of computing the following metrics from a generator in a single pass. Ideally the code computes the count, sum, average, sd, max, min and any other stats you can think of.

UPDATE

Initial horrid code in this gist. See the gist here: https://gist.github.com/3038746

Using the great suggestions from @larsmans here is the final solution I went with. Using the named tuple really helped.

import random
from math import sqrt
from collections import namedtuple

def stat(gen):
    """Returns the namedtuple Stat as below."""
    Stat = namedtuple('Stat', 'total, sum, avg, sd, max, min')
    it = iter(gen)

    x0 = next(it)
    mx = mn = s = x0
    s2 = x0*x0
    n = 1

    for x in it:
        mx = max(mx, x)
        mn = min(mn, x)
        s += x
        s2 += x*x
        n += 1

    return Stat(n, s, s/n, sqrt(s2/n - s*s/n/n), mx, mn)

def random_int_list(size=100, start=0, end=1000):
    return (random.randrange(start,end,1) for x in xrange(size))

if __name__ == '__main__':
    r = stat(random_int_list())
    print r  #Stat(total=100, sum=56295, avg=562, sd=294.82537204250247, max=994, min=10)
Matt Alcock
  • 12,399
  • 14
  • 45
  • 61

1 Answers1

7
def statistics(it):
    """Returns number of elements, sum, max, min"""

    it = iter(it)

    x0 = next(it)
    maximum = minimum = total = x0
    n = 1

    for x in it:
        maximum = max(maximum, x)
        minimum = min(minimum, x)
        total += x
        n += 1

    return n, total, maximum, minimum

Add other statistics as you please. Consider using a namedtuple when the number of statistics to compute grows large.

If you want to get really fancy, you can build an OO hierarchy of statistics collectors (untested):

class Summer(object):
    def __init__(self, x0=0):
        self.value = x0

    def add(self, x):
        self.value += x

class SquareSummer(Summer):
    def add(self, x):
        super(SquareSummer, self).add(x ** 2)

class Maxer(object):
    def __init__(self, x0):
        self.value = x0

    def add(self, x):
        self.value = max(self.value, x)

# example usage: collect([Maxer, Summer], iterable)
def collect(collectors, it):
    it = iter(it)

    x0 = next(it)
    collectors = [c(x0) for c in collectors]

    for x in it:
        for c in collectors:
            c.add(x)

    return [c.value for c in collectors]
Fred Foo
  • 355,277
  • 75
  • 744
  • 836
  • This is great @larsmans but why the need for it = iter(it) is this defensive type checking? – Matt Alcock Jul 03 '12 at 09:48
  • 1
    @MattAlcock: it's needed to get the function to work on iterables as well as iterators. E.g., `next([1])` raises a `TypeError`. – Fred Foo Jul 03 '12 at 09:51