1

I try some multiprocess examples, mainly : http://toastdriven.com/blog/2008/nov/11/brief-introduction-multiprocessing/ where I have taken the 'simple application', which use multiprocess to test URLs. When I use it (in Python 3.3, on Windows in PyCharm IDE) with some modifications, with a lot of URLs, my script never stop, and I don't see why.

import httplib2
import sys
from multiprocessing import Lock, Process, Queue, current_process

def worker(work_queue, done_queue):
    for url in iter(work_queue.get, 'STOP'):
        try:
            print("In : %s - %s." % (current_process().name, url))
            status_code = print_site_status(url)
            done_queue.put("%s - %s got %s." % (current_process().name, url, status_code))
        except:
            done_queue.put("%s failed on %s with: %s" % (current_process().name, url, str(sys.exc_info()[0])))
    print("Out : %s " % (current_process().name))
    return True

def print_site_status(url):
    http = httplib2.Http(timeout=10)
    headers, content = http.request(url)
    return headers.get('status', 'no response')

def main():
    workers = 8
    work_queue = Queue()
    done_queue = Queue()
    processes = []
    with open("Annu.txt") as f: # file with URLs
        lines = f.read().splitlines()
    for surl in lines:
        work_queue.put(surl)

    for w in range(workers):
        p = Process(target=worker, args=(work_queue, done_queue))
        p.start()
        processes.append(p)
        work_queue.put('STOP')

    for p in processes:
        p.join()
    print("END")
    done_queue.put('STOP')

    for status in iter(done_queue.get, 'STOP'):
        print(status)

if __name__ == '__main__':
    main()

I well see all the URLs status tested, and all the process 'Out' message that indicate hte end of the process, but never my 'END' message. A list of URLs I use is : http://www.pastebin.ca/2946850 .

So ... where is my error ? Is it a duplicate with : Python multiprocessing threads never join when given large amounts of work ?

Some informations : when I suppress 'done_queue' everywhere in the code : it's works.

Community
  • 1
  • 1
philnext
  • 3,242
  • 5
  • 39
  • 62
  • Note that END will not be printed at the END as you think. Scroll up close to halfway and you will see it there. – dopstar Mar 02 '15 at 10:22
  • @dopstar Sorry but I your comment is not clear, do you say that 'END' message is not at the end of script ? If Yes : I know, 'END' is here to say that processes are ended. – philnext Mar 02 '15 at 10:37
  • You are also printing out the status from the done queue after you printed END so this places END roughly midway. You have 500+ urls so you will have 1k+ print outs with END somewhere in the middle. – dopstar Mar 02 '15 at 12:24
  • I have confirmed that this prints out END somewhere in the middle. Check the output of your script after I ran it: http://paste.ubuntu.com/10501008/ . END is at line 553. – dopstar Mar 02 '15 at 12:36
  • @dopstar I saw your printing (thx for it !) and the result is OK for me : the 'END' message is after the 'Out : Process-x' ones (and before the 'Process-x - http://xxxxx' ones) . But, in my case the, 'Out : Process-x' messages are the last ! no 'END' one : I have (+/-) 500 prints not 1000. May be a Windows configuration issue... – philnext Mar 02 '15 at 13:17

2 Answers2

0

From the Queue documentation:

If optional args block is true and timeout is None (the default), block if necessary until an item is available.

That means your loops never terminate.

You either need to add a timeout to the get and stop the loop if your get the Empty exception or you need to exit the loop when you get the STOP message.

Aaron Digulla
  • 321,842
  • 108
  • 597
  • 820
  • 2
    Sure, a small number of URLS works. I use processes just to learn. I tested with some logging and it seems that all URLs are processed BUT the process seemed not to join. – philnext Mar 02 '15 at 10:33
  • Add `print` statements around the join to see which process doesn't come back. And how did you check that each URL is processed? The naked eye can't check 1000 URLs. – Aaron Digulla Mar 02 '15 at 15:12
  • I tested that all URLs where processed and all 'print("Out : %s " % (current_process().name))' lines done. It seems that ALL the processes doesn' t come bacK – philnext Mar 02 '15 at 17:31
  • I think they are waiting for you to read their stdout. You use `print` in the processes. That means they write to stdout. – Aaron Digulla Mar 03 '15 at 11:09
  • I don't think : I added 'print' to trace, AFTER the problem occured. – philnext Mar 03 '15 at 15:23
  • Do you see that output anywhere? – Aaron Digulla Mar 03 '15 at 15:24
  • Sure, all the prints of the processes are seen. – philnext Mar 03 '15 at 15:27
  • Or not ... `blocks` is `False` in your case, you should get an `Empty` exception. Why don't you see that one? – Aaron Digulla Mar 03 '15 at 15:29
  • Hey ! Your edited answer changed the game. As I said the 'queue' seemed the origin of the problem and it seems you have a good answer to fix it. I test ASAP and come back. – philnext Mar 03 '15 at 15:37
0

OK, I found the answer (in the Python doc : https://docs.python.org/3.4/library/multiprocessing.html#multiprocessing-programming ) :

Warning As mentioned above, if a child process has put items on a queue (and it has not used JoinableQueue.cancel_join_thread), then that process will not terminate until all buffered items have been flushed to the pipe.

So change the code :

    print("Out : %s " % (current_process().name))
    return True

By :

    print("Out : %s " % (current_process().name))
    done_queue.cancel_join_thread()
    return True

I don't understand why the initial code works with small quantity of URLs...

philnext
  • 3,242
  • 5
  • 39
  • 62