9

I'm parsing a file like this:

--header--
data1
data2
--header--
data3
data4
data5
--header--
--header--
...

And I want groups like this:

[ [header, data1, data2], [header, data3, data4, data5], [header], [header], ... ]

so I can iterate over them like this:

for grp in group(open('file.txt'), lambda line: 'header' in line):
    for item in grp:
        process(item)

and keep the detect-a-group logic separate from the process-a-group logic.

But I need an iterable of iterables, as the groups can be arbitrarily large and I don't want to store them. That is, I want to split an iterable into subgroups every time I encounter a "sentinel" or "header" item, as indicated by a predicate. Seems like this would be a common task, but I can't find an efficient Pythonic implementation.

Here's the dumb append-to-a-list implementation:

def group(iterable, isstart=lambda x: x):
    """Group `iterable` into groups starting with items where `isstart(item)` is true.

    Start items are included in the group.  The first group may or may not have a 
    start item.  An empty `iterable` results in an empty result (zero groups)."""
    items = []
    for item in iterable:
        if isstart(item) and items:
            yield iter(items)
            items = []
        items.append(item)
    if items:
        yield iter(items) 

It feels like there's got to be a nice itertools version, but it eludes me. The 'obvious' (?!) groupby solution doesn't seem to work because there can be adjacent headers, and they need to go in separate groups. The best I can come up with is (ab)using groupby with a key function that keeps a counter:

def igroup(iterable, isstart=lambda x: x):
    def keyfunc(item):
        if isstart(item):
            keyfunc.groupnum += 1       # Python 2's closures leave something to be desired
        return keyfunc.groupnum
    keyfunc.groupnum = 0
    return (group for _, group in itertools.groupby(iterable, keyfunc))

But I feel like Python can do better -- and sadly, this is even slower than the dumb list version:

# ipython
%time deque(group(xrange(10 ** 7), lambda x: x % 1000 == 0), maxlen=0)
CPU times: user 4.20 s, sys: 0.03 s, total: 4.23 s

%time deque(igroup(xrange(10 ** 7), lambda x: x % 1000 == 0), maxlen=0)
CPU times: user 5.45 s, sys: 0.01 s, total: 5.46 s

To make it easy on you, here's some unit test code:

class Test(unittest.TestCase):
    def test_group(self):
        MAXINT, MAXLEN, NUMTRIALS = 100, 100000, 21
        isstart = lambda x: x == 0
        self.assertEqual(next(igroup([], isstart), None), None)
        self.assertEqual([list(grp) for grp in igroup([0] * 3, isstart)], [[0]] * 3)
        self.assertEqual([list(grp) for grp in igroup([1] * 3, isstart)], [[1] * 3])
        self.assertEqual(len(list(igroup([0,1,2] * 3, isstart))), 3)        # Catch hangs when groups are not consumed
        for _ in xrange(NUMTRIALS):
            expected, items = itertools.tee(itertools.starmap(random.randint, itertools.repeat((0, MAXINT), random.randint(0, MAXLEN))))
            for grpnum, grp in enumerate(igroup(items, isstart)):
                start = next(grp)
                self.assertTrue(isstart(start) or grpnum == 0)
                self.assertEqual(start, next(expected))
                for item in grp:
                    self.assertFalse(isstart(item))
                    self.assertEqual(item, next(expected))

So: how can I subgroup an iterable by a predicate elegantly and efficiently in Python?

Doctor J
  • 5,974
  • 5
  • 44
  • 40
  • Your "append to list" version isn't consistent with what you say you want. It yields each item in the source iterable as a one-item list. Can you clarify what you're trying to do? Why not give an example of how you propose to use the result (i.e., are you going to iterate over it with nested for loops or what)? – BrenBarn Oct 08 '12 at 04:53
  • @BrenBarn: The generator converts `[1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0]` to `[[1, 0, 0, 0], [1, 0, 0], [1, 0], [1, 0]]`. – Blender Oct 08 '12 at 04:57
  • Ah, I see, I didn't notice what the default `isstart` was doing. But it would still be good to have an example of how you hope to use this. – BrenBarn Oct 08 '12 at 04:57
  • @BrenBarn: The second parameter returns `True` when an element denotes a section, so for that particular example, I used `igroup(l, lambda x: x == 1))`. I'd imagine the list version behaves identically. – Blender Oct 08 '12 at 04:58
  • You're right, I wasn't very clear; I added sample usage, and also made the example harder. :) – Doctor J Oct 08 '12 at 05:13
  • @DoctorJ: You continue to add additional requirements in comments on answers. You should edit your question to state completely what you expect the API for this object to be. It's not going to be possible to avoid storing data *and* also get independent sub-generators for the groups that you can use in any order. If you want to not store any temporary data, you must iterate over each group in order. – BrenBarn Oct 08 '12 at 18:34
  • @BrenBarn: Sorry about that; I didn't mean to frustrate anyone. I updated the unit test with my requirements, so if it passes, I'm happy. – Doctor J Oct 08 '12 at 20:57
  • @DoctorJ: so far only [my answer](http://stackoverflow.com/a/12789059/4279) passes all tests. But all solutions are slower than your `list`-based solution from the question on the `xrange(10**7)` benchmark on my machine. btw, you might be optimizing the wrong thing unless your profiler says that `igroup()` is the bottleneck and not IO or `process(item)`. – jfs Oct 08 '12 at 21:26
  • Yeah, I'm kind of just optimizing for fun now. :) Thanks for your help. – Doctor J Oct 09 '12 at 01:25

4 Answers4

5

how can I subgroup an iterable by a predicate elegantly and efficiently in Python?

Here's a concise, memory-efficient implementation which is very similar to the one from your question:

from itertools import groupby, imap
from operator import itemgetter

def igroup(iterable, isstart):
    def key(item, count=[False]):
        if isstart(item):
           count[0] = not count[0] # start new group
        return count[0]
    return imap(itemgetter(1), groupby(iterable, key))

It supports infinite groups.

tee-based solution is slightly faster but it consumes memory for the current group (similar to the list-based solution from the question):

from itertools import islice, tee

def group(iterable, isstart):
    it, it2 = tee(iterable)
    count = 0
    for item in it:
        if isstart(item) and count:
            gr = islice(it2, count)
            yield gr
            for _ in gr:  # skip to the next group
                pass
            count = 0
        count += 1
    if count:
       gr = islice(it2, count)
       yield gr
       for _ in gr:  # skip to the next group
           pass

groupby-solution could be implemented in pure Python:

def igroup_inline_key(iterable, isstart):
    it = iter(iterable)

    def grouper():
        """Yield items from a single group."""
        while not p[START]:
            yield p[VALUE]  # each group has at least one element (a header)
            p[VALUE] = next(it)
            p[START] = isstart(p[VALUE])

    p = [None]*2 # workaround the absence of `nonlocal` keyword in Python 2.x
    START, VALUE = 0, 1
    p[VALUE] = next(it)
    while True:
        p[START] = False # to distinguish EOF and a start of new group
        yield grouper()
        while not p[START]: # skip to the next group
            p[VALUE] = next(it)
            p[START] = isstart(p[VALUE])

To avoid repeating the code the while True loop could be written as:

while True:
    p[START] = False  # to distinguish EOF and a start of new group
    g = grouper()
    yield g
    if not p[START]:  # skip to the next group
        for _ in g:
            pass
        if not p[START]:  # EOF
            break

Though the previous variant might be more explicit and readable.

I think a general memory-efficient solution in pure Python won't be significantly faster than groupby-based one.

If process(item) is fast compared to igroup() and a header could be efficiently found in a string (e.g., for a fixed static header) then you could improve performance by reading your file in large chunks and splitting on the header value. It should make your task IO-bound.

Community
  • 1
  • 1
jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • Thanks for investigating. I just wanted to see if there was some clever chain-zip-slice-groupby-whatever incantation that was significantly better, but it sounds like not. For whatever reason, using an attribute and a generator is faster than a parameter and `itemgetter` for me. – Doctor J Oct 08 '12 at 20:54
  • @DoctorJ: your `igroup()` is also faster than the above function with `itemgetter()` for `xrange(10**7)` benchmark on my machine. – jfs Oct 08 '12 at 20:56
  • Note that, like `groupby`, this will consume the source iterable if you try to list the outer iterable. So although you can do `list()` on it to get a list whose length is equal to the number of groups, you can't actually use the resulting sub-iterables in that list, because the groupby already exhausted them in order to group them. I guess it depends what you want to do. – BrenBarn Oct 09 '12 at 18:43
  • @BrenBarn: it is correct. [`groupby` docs mention it](http://docs.python.org/py3k/library/itertools#itertools.groupby): «Because the source is shared, when the groupby() object is advanced, the previous group is no longer visible. So, if that data is needed later, it should be stored as a list». Solutions that don't use `groupby` directly have "skip to the next group" comment. – jfs Oct 09 '12 at 18:55
4

I didn't quite read all your code, but I think this might be of some help:

from itertools import izip, tee, chain


def pairwise(iterable):
    a, b = tee(iterable)
    return izip(a, chain(b, [next(b, None)]))


def group(iterable, isstart):

    pairs = pairwise(iterable)

    def extract(current, lookahead, pairs=pairs, isstart=isstart):
        yield current
        if isstart(lookahead):
            return
        for current, lookahead in pairs:
            yield current
            if isstart(lookahead):
                return

    for start, lookahead in pairs:
        gen = extract(start, lookahead)
        yield gen
        for _ in gen:
            pass


for gen in group(xrange(4, 16), lambda x: x % 5 == 0):
    print '------------------'
    for n in gen:
        print n

print [list(g) for g in group([], lambda x: x % 5 == 0)]

Result:

$ python gen.py
------------------
4
------------------
5
6
7
8
9
------------------
10
11
12
13
14
------------------
15
[]

Edit:

And here's another solution, similar to the above, but without the pairwise() and a sentinel instead. I don't know which one is faster:

def group(iterable, isstart):

    sentinel = object()

    def interleave(iterable=iterable, isstart=isstart, sentinel=sentinel):
        for item in iterable:
            if isstart(item):
                yield sentinel
            yield item

    items = interleave()

    def extract(item, items=items, isstart=isstart, sentinel=sentinel):
        if item is not sentinel:
            yield item
        for item in items:
            if item is sentinel:
                return
            yield item

    for lookahead in items:
        gen = extract(lookahead)
        yield gen
        for _ in gen:
            pass

Both now pass the test case, thanks to J.F.Sebastians idea for the exhaustion of skipped subgroup generators.

pillmuncher
  • 10,094
  • 2
  • 35
  • 33
  • 1
    1. `[list(g) for g in group([0,0,0], lambda x: x == 0)]` -> `[[0, 0], [0]]` 2. `CPU times: user 8.16 s, sys: 0.04 s, total: 8.20 s` 3. `$ wc -l gen.py` -> `18` – Doctor J Oct 08 '12 at 05:23
  • @DoctorJ: You're right. I guess, I just felt a bit lispy... I'll fix it now. – pillmuncher Oct 08 '12 at 05:27
  • There was an odd kinda off-by-one error in my code, eg. when the first item was 4. Fixed now. – pillmuncher Oct 08 '12 at 05:52
  • Closer! How about when `iterable` is empty? `[list(g) for g in group([], lambda x: x == 0)]` -> `[[]]`. :( Also, your performance has regressed: `CPU times: user 13.15 s, sys: 0.01 s, total: 13.16 s`. – Doctor J Oct 08 '12 at 06:25
  • @DoctorJ: That's the best I can do for now. I don't think it'll work though, unless you completely consume every item of every group exactly in the same order the're created. When you have two groups and read one item from the first, then one from the second, and then again one from the first, the order will be screwed up. That's because the thing is truly dynamic and nothing (except the lookahead item) is cached, and everything operates on the very same iterator. – pillmuncher Oct 08 '12 at 07:57
  • The loop in `group` could ensure that each group is consumed by `e = extract( ); yield e; for dummy in e: pass` – Janne Karila Oct 08 '12 at 09:46
  • it fails for [`[0,1,2]*3, isstart=lambda x: x == 0` test case](http://ideone.com/Ymo1y) from the question. – jfs Oct 08 '12 at 21:11
  • @J.F.Sebastian: That's a corollary to what was talking about in my comment above: you have to consume every subgroup generator before you can proceed to the next one, since if you don't, every subgroup generator get's created with one element from the underlying iterator and if you don't start consuming that subgenerator, and create another one instead, that one get's started with the next element of the underlying iterator, although that element should belong to the previous subgenerator. That's really a spec'd bug: *"groups can be arbitrarily large and I don't want to store them"*. – pillmuncher Oct 09 '12 at 07:43
  • @J.F.Sebastian: To go a little deeper: if you want to ensure that the next subgroup starts at the next dividing element (as defined by *isstart()*), you would have to either consume the underlying iterator up to that point in the current subgroup generator, or use divination. If you go with consuming, then you must store the consumed elements until they're in turn consumed by the client. But that would violate the spec. – pillmuncher Oct 09 '12 at 10:37
  • @pillmzuncher: your general idea is right but it is incorrect that you must store elements to pass the tests in the question. Look at [my answer](http://stackoverflow.com/a/12789059/4279). – jfs Oct 09 '12 at 10:47
  • @J.F.Sebastian: [Your code also fails](http://www.python-forum.de/pastebin.php?mode=view&s=306) under the conditions I mentioned in my comments. – pillmuncher Oct 09 '12 at 12:52
  • @pillmuncher: where in the question is the requirement to consume items out of order? Your code fails on the tests *from the question*. – jfs Oct 09 '12 at 15:56
  • @J.F.Sebastian: You're right. Please take a look at my refined code. – pillmuncher Oct 09 '12 at 23:50
2

The crucial thing is you have to write a generator that yields sub-generators. My solution is similar in concept to the one by @pillmuncher, but is more self-contained because it avoids using itertools machinery to make ancillary generators. The downside is I have to use a somewhat inelegant temp list. In Python 3 this could perhaps be done more nicely with nonlocal.

def grouper(iterable, isstart):
    it = iter(iterable)
    last = [next(it)]
    def subgroup():
        while True:
            toYield = last[0]
            try:
                last.append(next(it))
            except StopIteration, e:
                last.pop(0)
                yield toYield
                raise StopIteration
            else:
                yield toYield
                last.pop(0)
            if isstart(last[0]):
                raise StopIteration
    while True:
        sg = subgroup()
        yield sg
        if len(last) == 2:
            # subgenerator was aborted before completion, let's finish it
            for a in sg:
                pass
        if last:
            # sub-generator left next element waiting, next sub-generator will yield it
            pass
        else:
            # sub-generator left "last" empty because source iterable was exhausted
            raise StopIteration

>>> for g in grouper([0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0], lambda x: x==0):
...     print "Group",
...     for i in g:
...         print i,
...     print
Group 0 1 1
Group 0 1
Group 0 1 1 1 1
Group 0

I don't know what this is like performance-wise, I just did it because it was just an interesting thing to try to do.

Edit: I ran your unit test on your original two and mine. It looks like mine is a bit faster than your igroup but still slower than the list-based version. It seems natural that you'll have to make a tradeoff between speed and memory here; if you know the groups won't be too terribly large, use the list-based version for speed. If the groups could be huge, use a generator-based version to keep memory usage down.

Edit: The edited version above handles breaking in a different way. If you break out of the sub-generator but resume the outer generator, it will skip the remainder of the aborted group and begin with the next group:

>>> for g in grouper([0, 1, 2, 88, 3, 0, 1, 88, 2, 3, 4, 0, 1, 2, 3, 88, 4], lambda x: x==0):
...     print "Group",
...     for i in g:
...         print i,
...         if i==88:
...             break
...     print
Group 0 1 2 88
Group 0 1 88
Group 0 1 2 3 88
BrenBarn
  • 242,874
  • 37
  • 412
  • 384
  • Almost! Unfortunately, this hangs if you don't fully consume every group. I didn't mention this as a requirement, but that's a good test case. :) Also, the unit test is randomized and not a good performance test. 1. `%time deque(chain.from_iterable(grouper(xrange(10**7), lambda x: x % 1000 == 0)), maxlen=0)` -> `CPU times: user 10.11 s, sys: 0.05 s, total: 10.16 s` 2. `$ wc -l grouper.py` -> `22`. – Doctor J Oct 08 '12 at 05:56
  • @DoctorJ: What do you mean by "hangs"? – BrenBarn Oct 08 '12 at 06:35
  • @DoctorJ: Did you mean that if you break in the middle of a sub-generator, it also ends the outer one? I wouldn't really call that "hanging", but I see that my solution does do that. I edited my answer with a version that works differently: aborting a sub-generator but continuing with the outer generator now resumes at the beginning of the next group. – BrenBarn Oct 08 '12 at 07:02
  • I mean this: `list(grouper([0,1,2], lambda x: x == 0))` loops forever, exhausts my memory and DOSes my machine. :( Try the updated unit test. – Doctor J Oct 08 '12 at 18:12
  • @DoctorJ: You really need to be clear about your requirements. I thought the whole point of this was that you *weren't* going to call `list` on the thing. WHat do you want it to do if you list it? Exhaust the sub-generators into sublists? – BrenBarn Oct 08 '12 at 18:25
  • You can't expect to be able to `list` the outer generator but get the elements from the inner generator, if you also want to avoid storing temporary data. If you want to avoid storing temp data, the data must be read in the inner generator, which means the outer generator can't know how many groups there will be, which means you can't `list()` the outer generator by itself. – BrenBarn Oct 08 '12 at 18:40
  • @BrenBarn: You're right `groupby`-based (memory-efficient) solutions require consuming the input iterable in order but `len(list(grouper([0,1,2], lambda x: x == 0)))` shouldn't raise `MemoryError`. It should produce number of groups as [solutions in my answer](http://stackoverflow.com/a/12789059/4279) – jfs Oct 08 '12 at 20:52
  • i.e., your solutions fails for [`[0,1,2]*3, isstart=lambda x: x == 0` test case from the question](http://ideone.com/GGctB). – jfs Oct 08 '12 at 21:18
0

So here's another version that tries to stitch together pairs of subgrouops from groupby with chain. It's noticeably faster for the performance test given, but much much slower when there are many small groups (say isstart = lambda x: x % 2 == 0). It cheats and buffers repeated headers (you could get round this with read-all-but-last iterator tricks). It's also a step backward in the elegance department, so I think I still prefer the original.

def group2(iterable, isstart=lambda x: x):
    groups = itertools.groupby(iterable, isstart)
    start, group = next(groups)
    if not start:                   # Deal with initial non-start group
        yield group
        _, group = next(groups)
    groups = (grp for _, grp in groups)
    while True:                     # group will always be start item(s) now      
        group = list(group)         
        for item in group[0:-1]:    # Back-to-back start items... and hope this doesn't get very big.  :)
            yield iter([item])      
        yield itertools.chain([group[-1]], next(groups, []))       # Start item plus subsequent non-start items
        group = next(groups)
%time deque(group2(xrange(10 ** 7), lambda x: x % 1000 == 0), maxlen=0)
CPU times: user 3.13 s, sys: 0.00 s, total: 3.13 s
Doctor J
  • 5,974
  • 5
  • 44
  • 40
  • [here's my entry](http://ideone.com/emXYf) for the byzantine solutions to a simple problem contest ;) – jfs Oct 10 '12 at 00:57