3

Python's multiprocessing.Pool.imap is very convenient to process large files line by line:

import multiprocessing

def process(line):
    processor = Processor('some-big.model') # this takes time to load...
    return processor.process(line)

if __name__ == '__main__':
    pool = multiprocessing.Pool(4)
    with open('lines.txt') as infile, open('processed-lines.txt', 'w') as outfile:
        for processed_line in pool.imap(process, infile):
            outfile.write(processed_line)

How can I make sure that helpers such as Processor in the example above are loaded only once? Is this possible at all without resorting to a more complicated/verbose structure involving queues?

sam
  • 1,406
  • 2
  • 15
  • 25
  • 1
    It's possible, that's what the `Pool`-parameter [initializer](https://docs.python.org/3.7/library/multiprocessing.html#multiprocessing.pool.Pool) is made for ([example](https://stackoverflow.com/a/53621343/9059420)). – Darkonaut Jul 08 '19 at 16:25
  • Thanks for the hint, @Darkonaut! I've also found two [related](https://stackoverflow.com/questions/10117073/how-to-use-initializer-to-set-up-my-multiprocess-pool?answertab=votes#tab-top) [posts](https://stackoverflow.com/questions/38795826/optimizing-multiprocessing-pool-with-expensive-initialization?answertab=votes#tab-top) on Stackoverflow in the meantime, but I've literally wasted hours of googling on this one... – sam Jul 08 '19 at 18:08

1 Answers1

2

multiprocessing.Pool allows for resource initialisation via the initializer and initarg parameters. I was surprised to learn that the idea is to make use of global variables, as illustrated below:

import multiprocessing as mp

def init_process(model):
    global processor
    processor = Processor(model) # this takes time to load...

def process(line):
    return processor.process(line) # via global variable `processor` defined in `init_process`

if __name__ == '__main__':
    pool = mp.Pool(4, initializer=init_process, initargs=['some-big.model'])
    with open('lines.txt') as infile, open('processed-lines.txt', 'w') as outfile:
        for processed_line in pool.imap(process, infile):
            outfile.write(processed_line)

The concept isn't very well described in multiprocessing.Pool's documentation, so I hope this example will be helpful to others.

sam
  • 1,406
  • 2
  • 15
  • 25