0

Since my scaper is running so slow (one page at a time) so I'm trying to use thread to make it work faster. I have a function scrape(website) that take in a website to scrape, so easily I can create each thread and call start() on each of them.

Now I want to implement a num_threads variable that is the number of threads that I want to run at the same time. What is the best way to handle those multiple threads?

For ex: supposed num_threads = 5 , my goal is to start 5 threads then grab the first 5 website in the list and scrape them, then if thread #3 finishes, it will grab the 6th website from the list to scrape immidiately, not wait until other threads end.

Any recommendation for how to handle it? Thank you

Kiddo
  • 1,910
  • 8
  • 30
  • 54
  • Perhaps this can help? http://stackoverflow.com/questions/2846653/python-multithreading-for-dummies But I'm not sure you really need threading in your case? Just some kind of queue? – trainoasis Feb 03 '15 at 20:43
  • Have you considered using [Scrapy](http://scrapy.org/). – Carl Groner Feb 03 '15 at 20:45
  • @trainoasis yes it's like queue, 5 threads will pull data from that queue to execute them – Kiddo Feb 03 '15 at 21:01

2 Answers2

0

It depends.

If your code is spending most of its time waiting for network operations (likely, in a web scraping application), threading is appropriate. The best way to implement a thread pool is to use concurrent.futures in 3.4. Failing that, you can create a threading.Queue object and write each thread as an infinite loop that consumes work objects from the queue and processes them.

If your code is spending most of its time processing data after you've downloaded it, threading is useless due to the GIL. concurrent.futures provides support for process concurrency, but again only works in 3.4+. For older Pythons, use multiprocessing. It provides a Pool type which simplifies the process of creating a process pool.

You should profile your code (using cProfile) to determine which of those two scenarios you are experiencing.

Kevin
  • 28,963
  • 9
  • 62
  • 81
  • thanks, i spend most of the time on my program to scrape, the data will be save to a text file and will be used by another script later – Kiddo Feb 03 '15 at 21:00
0

If you're using Python 3, have a look at concurrent.futures.ThreadPoolExecutor

Example pulled from the docs ThreadPoolExecutor Example:

import concurrent.futures
import urllib.request

URLS = ['http://www.foxnews.com/',
        'http://www.cnn.com/',
        'http://europe.wsj.com/',
        'http://www.bbc.co.uk/',
        'http://some-made-up-domain.com/']

# Retrieve a single page and report the url and contents
def load_url(url, timeout):
    conn = urllib.request.urlopen(url, timeout=timeout)
    return conn.readall()

# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
    # Start the load operations and mark each future with its URL
    future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
    for future in concurrent.futures.as_completed(future_to_url):
        url = future_to_url[future]
        try:
            data = future.result()
        except Exception as exc:
            print('%r generated an exception: %s' % (url, exc))
        else:
            print('%r page is %d bytes' % (url, len(data)))

If you're using Python 2, there is a backport available:

ThreadPoolExecutor Example:

from concurrent import futures
import urllib.request

URLS = ['http://www.foxnews.com/',
        'http://www.cnn.com/',
        'http://europe.wsj.com/',
        'http://www.bbc.co.uk/',
        'http://some-made-up-domain.com/']

def load_url(url, timeout):
    return urllib.request.urlopen(url, timeout=timeout).read()

with futures.ThreadPoolExecutor(max_workers=5) as executor:
    future_to_url = dict((executor.submit(load_url, url, 60), url)
                         for url in URLS)

    for future in futures.as_completed(future_to_url):
        url = future_to_url[future]
        if future.exception() is not None:
            print('%r generated an exception: %s' % (url,
                                                     future.exception()))
        else:
            print('%r page is %d bytes' % (url, len(future.result())))
cziemba
  • 664
  • 5
  • 10