6

Problem

I'm using Python's multiprocessing module to execute functions asynchronously. What I want to do is be able to track the overall progress of my script as each process calls and executes def add_print. For instance, I would like the code below to add 1 to total and print out the value (1 2 3 ... 18 19 20) every time a process runs that function. My first attempt was to use a global variable but this didn't work. Since the function is being called asynchronously, each process reads total as 0 to start off, and adds 1 independently of other processes. So the output is 20 1's instead of incrementing values.

How could I go about referencing the same block of memory from my mapped function in a synchronous manner, even though the function is being run asynchronously? One idea I had was to somehow cache total in memory and then reference that exact block of memory when I add to total. Is this a possible and fundamentally sound approach in python?

Please let me know if you need anymore info or if I didn't explain something well enough.

Thanks!


Code

#!/usr/bin/python

## Import builtins
from multiprocessing import Pool 

total = 0

def add_print(num):
    global total
    total += 1
    print total


if __name__ == "__main__":
    nums = range(20)

    pool = Pool(processes=20)
    pool.map(add_print, nums)
Austin A
  • 2,990
  • 6
  • 27
  • 42
  • 1
    Doing stuff like this (setting a global variable for progress etc.) forces a synchronisation, so it might limit your performance. Not very likely if you have a large enough work unit and update the variable rarely. Doing it in a tight loop might cripple you though. – schlenk Aug 03 '15 at 01:10

1 Answers1

9

You could use a shared Value:

import multiprocessing as mp

def add_print(num):
    """
    https://eli.thegreenplace.net/2012/01/04/shared-counter-with-pythons-multiprocessing
    """
    with lock:
        total.value += 1
    print(total.value)

def setup(t, l):
    global total, lock
    total = t
    lock = l

if __name__ == "__main__":
    total = mp.Value('i', 0)
    lock = mp.Lock()
    nums = range(20)
    pool = mp.Pool(initializer=setup, initargs=[total, lock])
    pool.map(add_print, nums)

The pool initializer calls setup once for each worker subprocess. setup makes total a global variable in the worker process, so total can be accessed inside add_print when the worker calls add_print.

Note, the number of processes should not exceed the number of CPUs your machine has. If you do, the excess subprocesses will wait around for a CPUs to become available. So don't use processes=20 unless you have 20 or more CPUs. If you don't supply a processes argument, multiprocessing will detect the number of CPUs available and spawn a pool with that many workers for you. The number of tasks (e.g. the length of nums) usually greatly exceeds the number of CPUs. That's fine; the tasks are queued and processed by one of the workers as a worker becomes available.

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • 2
    this is exactly what I was looking for. I would like to say that I missed this in the docs but it never even dawned on me that the module handled shared state. Also, good note on the processor setting! Couldn't be happier with this answer! – Austin A Aug 03 '15 at 02:08
  • after playing around with your answer a little bit, I notice a distinct difference in the behavior between setting `processes=8` and `processes=20`. From your explanation, it sounds like these two settings should result in the same actions. Is this not the case? I have 8 CPU's by the way. – Austin A Aug 03 '15 at 03:53
  • If you are using Windows, [this might explain it](http://stackoverflow.com/q/11151727/190597). – unutbu Aug 03 '15 at 10:35
  • I should have mentioned that this will be used on both OSX and Ubuntu. – Austin A Aug 03 '15 at 16:05
  • Wouldn't you face quite easily race conditions here? – ImanolUr Mar 05 '19 at 16:13
  • @ImanolUr: You are absolutely right. Thanks for the correction. – unutbu Mar 05 '19 at 16:42