33

I am new to both python, and to threads. I have written python code which acts as a web crawler and searches sites for a specific keyword. My question is, how can I use threads to run three different instances of my class at the same time. When one of the instances finds the keyword, all three must close and stop crawling the web. Here is some code.

class Crawler:
      def __init__(self):
            # the actual code for finding the keyword 

 def main():  
        Crawl = Crawler()

 if __name__ == "__main__":
        main()

How can I use threads to have Crawler do three different crawls at the same time?

LondonRob
  • 73,083
  • 37
  • 144
  • 201
user446836
  • 733
  • 4
  • 16
  • 24

5 Answers5

56

There doesn't seem to be a (simple) way to terminate a thread in Python.

Here is a simple example of running multiple HTTP requests in parallel:

import threading

def crawl():
    import urllib2
    data = urllib2.urlopen("http://www.google.com/").read()

    print "Read google.com"

threads = []

for n in range(10):
    thread = threading.Thread(target=crawl)
    thread.start()

    threads.append(thread)

# to wait until all three functions are finished

print "Waiting..."

for thread in threads:
    thread.join()

print "Complete."

With additional overhead, you can use a multi-process aproach that's more powerful and allows you to terminate thread-like processes.

I've extended the example to use that. I hope this will be helpful to you:

import multiprocessing

def crawl(result_queue):
    import urllib2
    data = urllib2.urlopen("http://news.ycombinator.com/").read()

    print "Requested..."

    if "result found (for example)":
        result_queue.put("result!")

    print "Read site."

processs = []
result_queue = multiprocessing.Queue()

for n in range(4): # start 4 processes crawling for the result
    process = multiprocessing.Process(target=crawl, args=[result_queue])
    process.start()
    processs.append(process)

print "Waiting for result..."

result = result_queue.get() # waits until any of the proccess have `.put()` a result

for process in processs: # then kill them all off
    process.terminate()

print "Got result:", result
Jeremy
  • 1
  • 85
  • 340
  • 366
  • 1
    Thank You for your answer. What exactly does the join statement do? and how would a multi-process approach be implemented? – user446836 Jun 08 '11 at 23:15
  • 2
    The join basically says, wait here until the thread(run method) stops processing. – Nix Jun 08 '11 at 23:18
  • 2
    `.join()` waits until the thread has finished executing -- so it can't be used to stop the crawlers, but can only be used to synchronize your code after the crawling is finished. I have had added a multi-process example to my post (I didn't remember the API off the top of my head :P). – Jeremy Jun 08 '11 at 23:19
  • 1
    Your updated comment that included the multiprocessing seems to be working great, except that the processes aren't being terminated. The program hangs at result = result_queue.get(). Any idea what I am doing wrong?? – user446836 Jun 08 '11 at 23:29
  • @Nix - I'm just learning how threading in Python works, but I think looping over the threads and calling `.join()` on them may be more complicated than you're making it out to be. `.join()` blocks the calling thread until the called thread completes. Thus each time `.join()` is called, the loop stops running until that particular thread completes. All the threads aren't joined at once, but rather they're run concurrently and block that loop one at a time in the order they're added to threads (not based on when they end, although they won't unblock until they end.) Am I correct? – ArtOfWarfare Oct 30 '13 at 17:31
6

Starting a thread is easy:

thread = threading.Thread(function_to_call_inside_thread)
thread.start()

Create an event object to notify when you are done:

event = threading.Event()
event.wait() # call this in the main thread to wait for the event
event.set() # call this in a thread when you are ready to stop

Once the event has fired, you'll need to add stop() methods to your crawlers.

for crawler in crawlers:
    crawler.stop()

And then call join on the threads

thread.join() # waits for the thread to finish

If you do any amount of this kind of programming, you'll want to look at the eventlet module. It allows you to write "threaded" code without many of the disadvantages of threading.

Winston Ewert
  • 44,070
  • 10
  • 68
  • 83
5

First off, if you're new to python, I wouldn't recommend facing threads yet. Get used to the language, then tackle multi-threading.

With that said, if your goal is to parallelize (you said "run at the same time"), you should know that in python (or at least in the default implementation, CPython) multiple threads WILL NOT truly run in parallel, even if multiple processor cores are available. Read up on the GIL (Global Interpreter Lock) for more information.

Finally, if you still want to go on, check the Python documentation for the threading module. I'd say Python's docs are as good as references get, with plenty of examples and explanations.

salezica
  • 74,081
  • 25
  • 105
  • 166
  • 17
    "multiple threads WILL NOT trully run inparallel, even if multiple processor cores are available." That's oversimplified and unhelpful in this case. Many blocking operations, like HTTP requests, release the GIL and *will* run in parallel. Simple threads are probably sufficient here. – Jeremy Jun 08 '11 at 23:05
0

First of all, threading is not a solution in Python. Due to GIL, Threads does not work in parallel. So you can handle this with multiprocessing and you'll be limited with the number of processor cores.

What's the goal of your work ? You want to have a crawler ? Or you have some academic goals (learning about threading and Python, etc.) ?

Another point, Crawl waste more resources than other programs, so what is the sale your crawl ?

Wael Ben Zid El Guebsi
  • 2,670
  • 1
  • 17
  • 15
0

For this problem, you can use either the threading module (which, as others have said, will not do true threading because of the GIL) or the multiprocessing module (depending on which version of Python you're using). They have very similar APIs, but I recommend multiprocessing, as it is more Pythonic, and I find that communicating between processes with Pipes pretty easy.

You'll want to have your main loop, which will create your processes, and each of these processes should run your crawler have a pipe back to the main thread. Your process should listen for a message on the pipe, do some crawling, and send a message back over the pipe if it finds something (before terminating). Your main loop should loop over each of the pipes back to it, listening for this "found something" message. Once it hears that message, it should resend it over the pipes to the remaining processes, then wait for them to complete.

More information can be found here: http://docs.python.org/library/multiprocessing.html

Robotica
  • 96
  • 1
  • 4
  • 3
    The multiprocessing module only really makes sense to use if you are actually CPU bound. – Winston Ewert Jun 08 '11 at 23:17
  • 2
    Is there a reason *not* to use it if you aren't? Note: I'm not saying you have to use it. You can implement approximately the same solution using the threading module. – Robotica Jun 08 '11 at 23:18
  • 2
    Agreed, no real reason to use multiprocessing here, just additional headaches. – Nix Jun 08 '11 at 23:19
  • 3
    extra overhead in starting additional processes, increased complexity in communication between host and client programs, decreased compatibility with other python implementations, slightly different behaviour on linux/windows. – Winston Ewert Jun 08 '11 at 23:20
  • 2
    Honest question - how is it more of a headache than using threading? Having used both, I found them to be pretty equal in pain. – Robotica Jun 08 '11 at 23:21