-1

I have a really long input stream, which I read line by line with the generator. Organizing data in batches greatly helps with the processing rate. The data reading loop approximately looks like this:

# Create tuple stream generator (an example, not the real code)
input_gen = ( (vals[0], vals[0]) for block in input_file for vals in block.split() )  

while True:
  batch = tuple(itertools.isslice(input_gen, 42)) # 42 is the batch size
  if len(batch) == 0:
    break
  # process batch

The while-if construction looks cumbersome. Whether it is possible to organize the code with a simple for loop?

For example:

for batch in <some_expression>:
  # process batch
Anton K
  • 4,658
  • 2
  • 47
  • 60

1 Answers1

1

My approach would be to wrap your generator into a "chunking" function:

from itertools import islice

def chunk(it, size):
    it = iter(it)
    return iter(lambda: tuple(islice(it, size)), ())
>>> from itertools import count
>>> c = chunk(count(666), 13)
>>> next(c)
(666, 667, 668, 669, 670, 671, 672, 673, 674, 675, 676, 677, 678)
>>> next(c)
(679, 680, 681, 682, 683, 684, 685, 686, 687, 688, 689, 690, 691)

Since chunk returns an iterator, using for with it is the most natural thing to do:

>> for c in chunk(count(666), 13):
...     s = sum(c)
...     print(s)
...     if s > 10_000:
...         break
...
8736
8905
9074
9243
9412
9581
9750
9919
10088

Here I had to break, because otherwise we'll be summing infinitely (since count() returns an infinite iterator). It would work as expected with finite iterators, too:

>>> for c in chunk(range(69), 13):
...     print(sum(c))
...
78
247
416
585
754
266

Taken from here, you can also read about other approaches to "chunking" and padding issues there.

Nikolaj Š.
  • 1,457
  • 1
  • 10
  • 17
  • 1
    Your current example doesn't demonstrate that it ends well. I suggest you use a short `range` instead of `count`+`break`. – Kelly Bundy Jun 12 '22 at 16:24
  • Good point, @KellyBundy, I didn't think of that. My usual concern with similar solutions is whether it works with (potentially) infinite iterators, because lots of people like to slice or `len()` or `list()` them. I've added another example. – Nikolaj Š. Jun 12 '22 at 18:10