Multicore processing on scraper function

Question

I was hoping to speed up my scraper by using multiple cores so multiple cores could scrape from the URLs in a list I have using a predefined function scrape. How would I do this?

Here is my current code:

for x in URLs['identifier'][1:365]:
test= scrape(x)
results = test.get_results
results['identifier'] = x
final= final.append(results)

You can simply use joblib/go with threading; Might Help https://stackoverflow.com/a/62548199/6524169 — Aditya, Jul 19 '20 at 05:22
this would let all of my 8 cores process an equal portion of my 365 urls? How would the code look also please? — KT6345, Jul 19 '20 at 05:24

score 0 · Answer 1 · answered Jul 19 '20 at 05:27

Something like this, (or you can also use Scrapy)

It will easily allow you to make a lot of requests in parallel provided the server can handle it as well;

# it's just a wrapper around concurrent.futures ThreadPoolExecutor with a nice tqdm progress bar!
from tqdm.contrib.concurrent import thread_map, process_map # for multi-threading, multi-processing respectively)

def chunk_list(lst, size):
    for i in range(0, len(lst), size):
        yield lst[i:i + size]

for idx, my_chunk in enumerate(chunk_list(huge_list, size=2**12)):
    for response in thread_map(<which_func_to_call>, my_chunk, max_workers=your_cpu_cores+6)):
        # which_func_to_call -> wrap the returned response json obj in this, etc
        # do something with the response now..
        # make sure to cache the chunk results as well (in case you are having lot of them)

OR

Using the pool from multi-processing module in Python..

from multiprocessing import Pool
import requests
from bs4 import BeautifulSoup

base_url = 'http://quotes.toscrape.com/page/'

all_urls = list()

def generate_urls():
    # better to yield them as well if you already have the URL's list etc..
    for i in range(1,11):
        all_urls.append(base_url + str(i))
    
def scrape(url):
    res = requests.get(url)
    print(res.status_code, res.url)

generate_urls()

p = Pool(10)
p.map(scrape, all_urls)
p.terminate()
p.join()

Multicore processing on scraper function

1 Answers1