I'm parsing a file like this:
--header-- data1 data2 --header-- data3 data4 data5 --header-- --header-- ...
And I want groups like this:
[ [header, data1, data2], [header, data3, data4, data5], [header], [header], ... ]
so I can iterate over them like this:
for grp in group(open('file.txt'), lambda line: 'header' in line):
for item in grp:
process(item)
and keep the detect-a-group logic separate from the process-a-group logic.
But I need an iterable of iterables, as the groups can be arbitrarily large and I don't want to store them. That is, I want to split an iterable into subgroups every time I encounter a "sentinel" or "header" item, as indicated by a predicate. Seems like this would be a common task, but I can't find an efficient Pythonic implementation.
Here's the dumb append-to-a-list implementation:
def group(iterable, isstart=lambda x: x):
"""Group `iterable` into groups starting with items where `isstart(item)` is true.
Start items are included in the group. The first group may or may not have a
start item. An empty `iterable` results in an empty result (zero groups)."""
items = []
for item in iterable:
if isstart(item) and items:
yield iter(items)
items = []
items.append(item)
if items:
yield iter(items)
It feels like there's got to be a nice itertools
version, but it eludes me. The 'obvious' (?!) groupby
solution doesn't seem to work because there can be adjacent headers, and they need to go in separate groups. The best I can come up with is (ab)using groupby
with a key function that keeps a counter:
def igroup(iterable, isstart=lambda x: x):
def keyfunc(item):
if isstart(item):
keyfunc.groupnum += 1 # Python 2's closures leave something to be desired
return keyfunc.groupnum
keyfunc.groupnum = 0
return (group for _, group in itertools.groupby(iterable, keyfunc))
But I feel like Python can do better -- and sadly, this is even slower than the dumb list version:
# ipython %time deque(group(xrange(10 ** 7), lambda x: x % 1000 == 0), maxlen=0) CPU times: user 4.20 s, sys: 0.03 s, total: 4.23 s %time deque(igroup(xrange(10 ** 7), lambda x: x % 1000 == 0), maxlen=0) CPU times: user 5.45 s, sys: 0.01 s, total: 5.46 s
To make it easy on you, here's some unit test code:
class Test(unittest.TestCase):
def test_group(self):
MAXINT, MAXLEN, NUMTRIALS = 100, 100000, 21
isstart = lambda x: x == 0
self.assertEqual(next(igroup([], isstart), None), None)
self.assertEqual([list(grp) for grp in igroup([0] * 3, isstart)], [[0]] * 3)
self.assertEqual([list(grp) for grp in igroup([1] * 3, isstart)], [[1] * 3])
self.assertEqual(len(list(igroup([0,1,2] * 3, isstart))), 3) # Catch hangs when groups are not consumed
for _ in xrange(NUMTRIALS):
expected, items = itertools.tee(itertools.starmap(random.randint, itertools.repeat((0, MAXINT), random.randint(0, MAXLEN))))
for grpnum, grp in enumerate(igroup(items, isstart)):
start = next(grp)
self.assertTrue(isstart(start) or grpnum == 0)
self.assertEqual(start, next(expected))
for item in grp:
self.assertFalse(isstart(item))
self.assertEqual(item, next(expected))
So: how can I subgroup an iterable by a predicate elegantly and efficiently in Python?