1

I've been playing with multiprocessing for a bit now and there is something that confuses me. I wrote this simple code to illustrate the problem:

from multiprocessing.pool import ThreadPool #I import the packages needed
from time import sleep

def long_task(n): #a simple long task
    sleep(1)
    print str(n)+" task finished"


pool = ThreadPool(8) #define my threadpool

for x in xrange(10**7): #it could be a while loop too
    print x
    pool.apply_async(long_task, args=(x,))

inside the for loop I expect my code to wait until one of the 8 thread as finished before starting another one, but x is being printed without any break. Why is it happening? How do I get what I'm looking for? And Is this code optimized?

sample output:

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Liam
  • 6,009
  • 4
  • 39
  • 53
  • 1
    Please provide a sample output. `print x` should run immediately 10**7 times. `print str(n)+" task finished"` will run with one second delay between each batch of 8. – Mad Physicist Apr 26 '17 at 16:14
  • Yes exactly. Also at the end of that count to 10**7-1, there should be some printouts of the form that shows in my answer. – Mad Physicist Apr 26 '17 at 16:29

1 Answers1

1

Part of the confusion you are having is that you are attempting to start 10**7 tasks. For the sake of experiment, reduce this to some sensible number, say 30. Your output will now be

0
1
2
...
27
28
29

Then, approximately one second later, something like

2 task finished3 task finished
0 task finished1 task finished


5 task finished4 task finished6 task finished


7 task finished

The text will be all scrambled up, and in my case the newlines usually got printed in batches. This is because calls to print are not synchronized properly. The next batch will print approximately a second later:

13 task finished
11 task finished9 task finished8 task finished12 task finished
10 task finished

Similar for the third batch. The last batch will only contain the last 6 outputs (24-30):

24 task finished
25 task finished
26 task finished
29 task finished27 task finished

28 task finished

The thing to remember is that tasks are scheduled immediately. That is the purpose of the thread pool. That means that they just get added to a list of things to run later, which is why you see the printout of x immediately. The tasks are actually run eight at a time, as you would expect. Actually, the tasks after the first batch are started one-by-one as threads become available, but since they all take almost exactly the same amount of time, it appears as though they are running in batches.

You can set up an experiment to see what happens when half of your tasks take 1 second to run and half take 2 seconds. While they will be started immediately in the order you add them to the queue, the threads for the 1 second tasks will become available twice as fast as the ones for 2 second tasks.

Mad Physicist
  • 107,652
  • 25
  • 181
  • 264
  • Shouldn't be. If you are only running 8 at a time, you should have no problem creating arrays for the inputs and outputs. – Mad Physicist Apr 26 '17 at 16:35
  • 1
    just to add, shouldn't you be using a `ProcessPool` instead of a `ThreadPool` seeing as this task isn't I/O intensive, granted this might not be what you plan to use this for – gold_cy Apr 26 '17 at 16:46
  • @Liam [this](http://stackoverflow.com/questions/1058523/whats-the-difference-between-using-the-thread-pool-and-a-normal-thread) post explains when to use `ThreadPool` in greater detail – gold_cy Apr 26 '17 at 16:53