29

I wanted to use threading in python to download lot of webpages and went through the following code which uses queues in one of the website.

it puts a infinite while loop. Does each of thread run continuously with out ending till all of them are complete? Am I missing something.

#!/usr/bin/env python
import Queue
import threading
import urllib2
import time

hosts = ["http://yahoo.com", "http://google.com", "http://amazon.com",
"http://ibm.com", "http://apple.com"]

queue = Queue.Queue()

class ThreadUrl(threading.Thread):
  """Threaded Url Grab"""
  def __init__(self, queue):
    threading.Thread.__init__(self)
    self.queue = queue

  def run(self):
    while True:
      #grabs host from queue
      host = self.queue.get()

      #grabs urls of hosts and prints first 1024 bytes of page
      url = urllib2.urlopen(host)
      print url.read(1024)

      #signals to queue job is done
      self.queue.task_done()

start = time.time()
def main():

  #spawn a pool of threads, and pass them queue instance 
  for i in range(5):
    t = ThreadUrl(queue)
    t.setDaemon(True)
    t.start()

  #populate queue with data   
  for host in hosts:
    queue.put(host)

  #wait on the queue until everything has been processed     
  queue.join()

main()
print "Elapsed Time: %s" % (time.time() - start)
raju
  • 4,788
  • 15
  • 64
  • 119
  • please check your indention. I've tried to correct it, but the way `queue.join` was misplaced, it could also have been on the top level. also the loop where you add the hosts to the queue is within the loop where you create the threads, so you add every host five times. – mata Nov 20 '12 at 20:28
  • With the indentation correction, the script works fine for me too. – lukedays Nov 20 '12 at 20:54
  • 3
    This code looks like it's copied with a few modifications from [here](http://www.ibm.com/developerworks/aix/library/au-threadingpython/) – daviewales Sep 06 '13 at 15:50

3 Answers3

20

Setting the thread's to be daemon threads causes them to exit when the main is done. But, yes you are correct in that your threads will run continuously for as long as there is something in the queue else it will block.

The documentation explains this detail Queue docs

The python Threading documentation explains the daemon part as well.

The entire Python program exits when no alive non-daemon threads are left.

So, when the queue is emptied and the queue.join resumes when the interpreter exits the threads will then die.

EDIT: Correction on default behavior for Queue

sean
  • 3,955
  • 21
  • 28
  • 2
    The default behavior of `get` is to *block* if the queue is empty, not raise an `Empty` exception. – Warren Weckesser Nov 20 '12 at 20:39
  • shouldn't the behaviour get corrected if we sleep the thread for 100 milliseconds or such time interval if it could not find any item in the queue. I am planning to span 50 threads in order to download more than 5000 pages. I don't think I can have all 50 threads fighting for cpu resources. – raju Nov 21 '12 at 04:15
  • 1
    No, by continuously I mean they will be in the infinite loop the entire time. You should not have to add in a sleep into the loop to free up CPU time. There will be a block at the `queue.get` and during the http request. This block will function as the point where a context switch will happen to the next thread. Fifty threads should not be an issue for your purpose as you will be limited by your download speed ultimately and whatever disk I/O you will be performing. – sean Nov 21 '12 at 04:49
  • I am worried about the cases when all 49 threads have completed their job and waiting for the 50th thread to complete the last http request. At that time wouldn't all 49 threads hog the cpu, not letting anything else to happen? – raju Nov 21 '12 at 07:16
  • 3
    No, the default action of the `queue.get` is to block as was corrected by @WarrenWeckesser. So, the 49 other threads would block thus not hogging the CPU. – sean Nov 21 '12 at 14:01
8

Your script works fine for me, so I assume you are asking what is going on so you can understand it better. Yes, your subclass puts each thread in an infinite loop, waiting on something to be put in the queue. When something is found, it grabs it and does its thing. Then, the critical part, it notifies the queue that it's done with queue.task_done, and resumes waiting for another item in the queue.

While all this is going on with the worker threads, the main thread is waiting (join) until all the tasks in the queue are done, which will be when the threads have sent the queue.task_done flag the same number of times as messages in the queue . At that point the main thread finishes and exits. Since these are deamon threads, they close down too.

This is cool stuff, threads and queues. It's one of the really good parts of Python. You will hear all kinds of stuff about how threading in Python is screwed up with the GIL and such. But if you know where to use them (like in this case with network I/O), they will really speed things up for you. The general rule is if you are I/O bound, try and test threads; if you are cpu bound, threads are probably not a good idea, maybe try processes instead.

good luck,

Mike

MikeHunter
  • 4,144
  • 1
  • 19
  • 14
-2

I don't think Queue is necessary in this case. Using only Thread:

import threading, urllib2, time

hosts = ["http://yahoo.com", "http://google.com", "http://amazon.com",
"http://ibm.com", "http://apple.com"]

class ThreadUrl(threading.Thread):
    """Threaded Url Grab"""
    def __init__(self, host):
        threading.Thread.__init__(self)
        self.host = host

    def run(self):
        #grabs urls of hosts and prints first 1024 bytes of page
        url = urllib2.urlopen(self.host)
        print url.read(1024)

start = time.time()
def main():
    #spawn a pool of threads
    for i in range(len(hosts)):
        t = ThreadUrl(hosts[i])
        t.start()

main()
print "Elapsed Time: %s" % (time.time() - start)
lukedays
  • 323
  • 1
  • 10
  • 3
    Not a very good plan if a future version wants to load 10,000 URLs. – Zan Lynx Nov 20 '12 at 20:31
  • Yeah, I wouldn't create 10000 threads. Instead, would create just a few threads with multiple URL fetchs. – lukedays Nov 20 '12 at 20:42
  • Not only that, if different threads are writing to stdout it's possible the output will be interleaved in an undesirable way. It's always simpler to queue result for processing by a single output thread. See [this qeustion](https://stackoverflow.com/questions/54942503/cant-read-write-to-files-using-multithreading-in-python/54943940#54943940) for relevant discussion. – holdenweb Mar 05 '19 at 14:00