0

I am processing many lines, where I want to group them based on whether one value x in the current line is within 100 of the value x in the previous line.

For the example

5, "hello"
10, "was"
60, "bla"
5000, "qwerty"

"hello", "was" and "bla" should be one group, "qwerty" another.

Is there a way to neatly solve this with groupby? All the solutions I can think of feel a bit hackish, like taking a dict default argument with the previous value in it and updating it each time the function (key) in the groupby is called.

The Unfun Cat
  • 29,987
  • 31
  • 114
  • 156
  • PS. This is a trivial problem to solve, I just want to do it as prettily and readable as possible, avoiding all the distracting "ifs" and temp variables neccessary in the naive solution. for line in groupby(stdin, cluster_function):... – The Unfun Cat Apr 26 '15 at 07:25
  • Do you mean `itertools.groupby`? [This question](http://stackoverflow.com/questions/29307784/groupby-based-on-value-in-previous-row/29308056) is about how to do this in pandas. – BrenBarn Apr 26 '15 at 07:32
  • Yeah, itertools groupby. Thats my q, but pd does not work quickly or robustly on streams. (Chunking data might destroy groups spanning chunks..) – The Unfun Cat Apr 26 '15 at 07:35
  • There's no way to do it without some sort of temp variable; since `groupby` doesn't store the previous value, you have to sotre it yourself somehow. – BrenBarn Apr 26 '15 at 08:05

2 Answers2

2

You could just write a simple class to encapsulate the temp variables, then use a method of that class as the key function:

class KeyClass(object):
    def __init__(self):
        self.lastValue = None
        self.currentKey = 1

    def handleVal(self, val):
        if self.lastValue is not None and abs(val - self.lastValue) > 100:
            self.currentKey += 1
        self.lastValue = val
        return self.currentKey

>>> [(k, list(g)) for k, g in itertools.groupby(data, KeyClass().handleVal)]
[(1, [1, 2, 100, 105]), (2, [300, 350, 375]), (3, [500]), (4, [800, 808])]

Just for fun, I also came up with this rather mind-bending way to do it by using the send method of a pre-advanced generator as the key function:

def keyGen():
    curKey = 1
    newVal = yield None
    while True:
        oldVal, newVal = newVal, (yield curKey)
        if oldVal is None or abs(newVal-oldVal) > 100:
            curKey += 1

key = keyGen()
next(key)

>>> [(k, list(g)) for k, g in itertools.groupby(data, key.send)]
[(1, [1, 2, 100, 105]), (2, [300, 350, 375]), (3, [500]), (4, [800, 808])]

Wrapping your head around that may be a good exercise in understanding .send (it was for me!).

BrenBarn
  • 242,874
  • 37
  • 412
  • 384
1

There might be some clever trick with itertools.groupby, but it is simple enough to write a custom generator function for your particular problem. Maybe something like this (untested):

def grouper(it):
    group = []
    for item in it:
        if not group or abs(int(item[0]) - int(group[-1][0])) < 100:
            group.append(item)
        else:
            yield group
            group = [item]
    if group:  # yield final group if not empty
        yield group

Usage would be something like

with open(filename) as fid:
    for group in grouper(line.split(',') for line in fid):
        # do something with group
        for item in group:
            # do something with item
Bas Swinckels
  • 18,095
  • 3
  • 45
  • 62
  • This might be the best and most pythonic way to do it (and deserves more upvotes), but the other answer answered my q, even though it might be a dumber way to do it. – The Unfun Cat Apr 27 '15 at 07:00