8

Suppose I have a generator whose __next__() function is somewhat expensive and I want to try to parallelize the calls. Where do I throw in the parallization?

To be slightly more concrete, consider this example:

# fast, splitting a file for example
raw_blocks = (b for b in block_generator(fin))
# slow, reading blocks, checking values ...
parsed_blocks = (block_parser(b) for b in raw_blocks)
# get all parsed blocks into a data structure
data = parsedBlocksToOrderedDict(parsed_blocks)

The most basic thing is to change the 2nd line to something that does the parallelization. Is there some generator magic that allows one to unpack the generator (on the 3rd) line in parallel? Calling __next__() in parallel?

MSeifert
  • 145,886
  • 38
  • 333
  • 352
mathtick
  • 6,487
  • 13
  • 56
  • 101
  • Generally, I would refrain from this. The guts of the interal may be very stateful and thread-unsafe. Consider improving the generator itself instead (assuming it's not just a simple generator expression, but even then you need some thread safety in the involved code to do this). –  Nov 01 '11 at 20:15
  • I think you've mentioned the solution in your answer. Parallelize the calls to `block_parser`. – agf Nov 01 '11 at 20:16
  • You may want to split you generator into multiple ones (if possible). Starting each one on a pre-calculated stating point. This way you might have a better performance. – Traxidus Wolf Apr 26 '19 at 19:06
  • I'm currently working on this. I have a generator that sends HTTP requests or process images in `__next__()`. What I did is to decouple the codes in `__next__()` into two parts: The first part generates something like metadata, e.g., image filename, and the second part does the expensive things. I implemented a wrapper that takes in a cheap generator and a decoding function which does the heavy single process task. It parallelizes the tasks by creating a worker pool and keeps submitting tasks to it. Feel free to use my code but do not use the version on pip, it's extremely unstable, and may so – slab chan May 13 '19 at 08:07

2 Answers2

7

Assuming the calls to block_parser(b) to be performed in parallel, you could try using a multiprocessing.Pool:

import multiprocessing as mp

pool = mp.Pool()

raw_blocks = block_generator(fin)
parsed_blocks = pool.imap(block_parser, raw_blocks)
data = parsedBlocksToOrderedDict(parsed_blocks)

Note that:

  • If you expect that list(parsed_blocks) can fit entirely in memory, then using pool.map can be much faster than pool.imap.
  • The items in raw_blocks and the return values from block_parse must be pickable since mp.Pool transfers tasks and results through a mp.Queue.
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • Yeah, I will use mp.Pool for sure. I will mark the other answer as "correct" since it answers the conceptual question I was asking about generators but this is a good solution you provided. – mathtick Nov 01 '11 at 20:26
  • isn't `raw_blocks` here the same generator here as `block_generator(fin)`? what is the advantage of creating this new additional but equal generator? – Max Power Jun 06 '19 at 16:11
  • @MaxPower: Thanks for the improvement. There is no need for the generator expression here. – unutbu Jun 06 '19 at 16:40
  • @unutbu and how things like `parsedBlocksToOrderedDict` should be done. I mean converting a for loop to a dict ? – user2284570 Jun 06 '20 at 08:08
6

No. You must call next() sequentially because any non-trivial generator's next state is determined by its current state.

def gen(num):
    j=0
    for i in xrange(num):
        j += i
        yield j

There's no way to parallelize calls to the above generator without knowing its state at each point it yields a value. But if you knew that, you wouldn't need to run it.

kindall
  • 178,883
  • 35
  • 278
  • 309
  • Thanks ... that's what I had guessed but wasn't sure if there was a way to do something with "trivial" generators i.e. "stationary" generators that do not have truly dependent __next__(). – mathtick Nov 01 '11 at 20:25
  • Trivial ones that iterate over a list could be parallelizable (really, you'd break up the list and have a thread iterate over each piece) *but* those are not the kind of generators that take so much time you'd want to parallelize them. – kindall Nov 01 '11 at 20:52
  • 1
    Trivial here means independent, not 'fast'. Maybe the trick is just not use generators since I do not need or want the concept of a 'state' ... I just need an index to jobs and args. – mathtick Nov 01 '11 at 21:10
  • By "trivial" I really meant a generator that always yields the same value and doesn't maintain any state whatsoever. – kindall Jun 14 '16 at 16:16