0

In a for loop, it is easy to track progress by doing:

total_num = 1000
for num, url in enumerate(urls):
    print '%s / %s' (num+1, total_num)
    # do something

How would I keep track of progress with using Pool?

# input data has 4M items
pool = Pool(parallel_workers)
pool.map(run_item, input_data)
David542
  • 104,438
  • 178
  • 489
  • 842
  • 1
    It all depends on what you mean by 'progress.' In the first case, you picked one particular notion: milestone progress (in this case the milestone was the iteration number). But what if some iterations are more expensive than others? Then, when you finish 50% of the iterations, you are not necessarily 50% done, so you might choose something different. One approach to this for parallel processing is to have the code to be run (`run_item`) log progress statistics to a database or file, and then spawn another job to periodically read from that log and print some description in the main process. – ely Feb 10 '15 at 20:51
  • In the simplest case, you could have each process create a temporary file based on the process ID. Then, instead of printing the number of iterations passes, it could write this to the file. Back in the main process, you can write code to read the last line of each process-specific file and print up some text that summarizes where each process is. Keep in mind that there could be issues with file locking and you might not always get perfectly up-to-date numbers (but still better than nothing); and you may incur extra time loss for the sake of the logging. The `logging` module is good for this. – ely Feb 10 '15 at 20:55

2 Answers2

1

Check out the answer to this question. I think this is what you want.

Python multiprocessing - tracking the process of pool.map operation

Essentially, you should use an iterated map or an asynchronous map.

Community
  • 1
  • 1
Mike McKerns
  • 33,715
  • 8
  • 119
  • 139
0

A very rudimentary an approximate way would be to have a global variable and then determine the size of that throughout progress. Here's an example:

global progress
progress = set()

def run_item(input_data):
    progress.add(url)
    print len(progress) * parallel_workers
David542
  • 104,438
  • 178
  • 489
  • 842