Groupby with function remembering state

Question

I am processing many lines, where I want to group them based on whether one value x in the current line is within 100 of the value x in the previous line.

For the example

5, "hello"
10, "was"
60, "bla"
5000, "qwerty"

"hello", "was" and "bla" should be one group, "qwerty" another.

Is there a way to neatly solve this with groupby? All the solutions I can think of feel a bit hackish, like taking a dict default argument with the previous value in it and updating it each time the function (key) in the groupby is called.

PS. This is a trivial problem to solve, I just want to do it as prettily and readable as possible, avoiding all the distracting "ifs" and temp variables neccessary in the naive solution. for line in groupby(stdin, cluster_function):... — The Unfun Cat, Apr 26 '15 at 07:25
Do you mean `itertools.groupby`? [This question](http://stackoverflow.com/questions/29307784/groupby-based-on-value-in-previous-row/29308056) is about how to do this in pandas. — BrenBarn, Apr 26 '15 at 07:32
Yeah, itertools groupby. Thats my q, but pd does not work quickly or robustly on streams. (Chunking data might destroy groups spanning chunks..) — The Unfun Cat, Apr 26 '15 at 07:35
There's no way to do it without some sort of temp variable; since `groupby` doesn't store the previous value, you have to sotre it yourself somehow. — BrenBarn, Apr 26 '15 at 08:05

score 2 · Accepted Answer · answered Apr 26 '15 at 08:12

You could just write a simple class to encapsulate the temp variables, then use a method of that class as the key function:

class KeyClass(object):
    def __init__(self):
        self.lastValue = None
        self.currentKey = 1

    def handleVal(self, val):
        if self.lastValue is not None and abs(val - self.lastValue) > 100:
            self.currentKey += 1
        self.lastValue = val
        return self.currentKey

>>> [(k, list(g)) for k, g in itertools.groupby(data, KeyClass().handleVal)]
[(1, [1, 2, 100, 105]), (2, [300, 350, 375]), (3, [500]), (4, [800, 808])]

Just for fun, I also came up with this rather mind-bending way to do it by using the send method of a pre-advanced generator as the key function:

def keyGen():
    curKey = 1
    newVal = yield None
    while True:
        oldVal, newVal = newVal, (yield curKey)
        if oldVal is None or abs(newVal-oldVal) > 100:
            curKey += 1

key = keyGen()
next(key)

>>> [(k, list(g)) for k, g in itertools.groupby(data, key.send)]
[(1, [1, 2, 100, 105]), (2, [300, 350, 375]), (3, [500]), (4, [800, 808])]

Wrapping your head around that may be a good exercise in understanding .send (it was for me!).

Bas Swinckels · Answer 2 · 2015-04-26T08:59:17.030

There might be some clever trick with itertools.groupby, but it is simple enough to write a custom generator function for your particular problem. Maybe something like this (untested):

def grouper(it):
    group = []
    for item in it:
        if not group or abs(int(item[0]) - int(group[-1][0])) < 100:
            group.append(item)
        else:
            yield group
            group = [item]
    if group:  # yield final group if not empty
        yield group

Usage would be something like

with open(filename) as fid:
    for group in grouper(line.split(',') for line in fid):
        # do something with group
        for item in group:
            # do something with item

This might be the best and most pythonic way to do it (and deserves more upvotes), but the other answer answered my q, even though it might be a dumber way to do it. — The Unfun Cat, Apr 27 '15 at 07:00

Groupby with function remembering state

2 Answers2