Threading to speed up scraping data from a given list of websites

Question

I wrote a program that scrapes information from a given list of websites (100 links) to visit. Currently, my program does this sequentially; that is, checks them one at a time. The skeleton of my program is as follows.

for j in range(len(num_of_links)):
    try: #if error occurs, this jumps to next of the list of website
        site_exist(j) #a function to check if site exists
        get_url_with_info(j) #a function to get links inside the website
    except Exception as e: 
        print(str(e))
filter_result_info(links_with_info) #function that filters result

Needless to say, this process is very slow. Thus, is it possible to implement threading such that my program can handle the job faster such that 4 concurrent jobs scrape the list of links 25 each. Can you point a reference on how I could do this?

Is it possible that your list will contain many links on the same site? If so, you should be slowing down your scraping and not speeding it up, and not doing so may appear to a single target like a denial of service attempt. — halfer, Oct 18 '15 at 08:44
If the sites are all different, and you are using cURL, then have a look at the "cURL multi" feature, which lets you start many HTTP ops in parallel. One can do this in PHP so I assure Python allows it too (it's the same underlying library). — halfer, Oct 18 '15 at 08:45
@halfer You are right. I should put some kind of threshold in the speed of accomplishing scraping in a given site (This I don't know how). Yes, the websites in the lists are all different with different contents. The idea is to get a a certain info in all websites in the list. Currently, my program finishes scraping all websites in about 4 minutes. I just think I can make it faster. Thanks for pointing out cURL multi. I will take a look. — JPdL, Oct 18 '15 at 09:19
[This link](https://pypi.python.org/pypi/pyparallelcurl/0.0.4) might help show how to use the cURL multi feature. — halfer, Oct 18 '15 at 09:31

score 2 · Answer 1 · answered Oct 18 '15 at 09:46

What you want is a Pool of threads.

from concurrent.futures import ThreadPoolExecutor


def get_url(url):
    try:
        if site_exists(url):
            return get_url_with_info(url)
        else:
            return None
    except Exception as error: 
        print(error)


with ThreadPoolExecutor(max_workers=4) as pool:
    future = pool.map(get_url, list_of_urls)

list_of_results = future.results()  # waits until all URLs have been retrieved
filter_result_info(list_of_results)  # note that some URL might be None

score 0 · Answer 2 · edited May 23 '17 at 12:15

0

Threading would not speed it up. Multiprocessing is probably what you want.

Multiprocessing vs Threading Python

edited May 23 '17 at 12:15

Community

1
1

answered Oct 18 '15 at 08:25

Chromadude

1
1
4

1

That is not true, threading speeds web requests up by making function calls non-blocking(async). You just need to know how to do it. – lingxiao Oct 18 '15 at 08:42
Threading and Asyncio only will speed up this process and multiprocessing can actually slow down process to some extent in this specific case since this is I/O bound problem, not the CPU bound. Multiprocessing speeds up operation only we have CPU bound problem. – Manish Jan 25 '22 at 09:00

Threading to speed up scraping data from a given list of websites

2 Answers2