Can't modify my script to delimit the number of requests while scraping

Question

I've written a script in python using Thread to handl multiple requests at the same time and do the scraping process faster. The script is doing it's job accordingly.

In short what the scraper does: It parses all the links from the landing page leading to its main page (where information are stored) and scrape happy hours and featured special from there. The scrapers keeps going on until all the 29 pages are crawled.

As there may be numerous links to play with, I would like to limit the number of requests. However, as I don't have much idea on this I can't find any ideal way to modify my existing script to serve the purpose.

Any help will be vastly appreciated.

This is my attempt so far:

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import threading

url = "https://www.totalhappyhour.com/washington-dc-happy-hour/?page={}"

def get_info(link):
    for mlink in [link.format(page) for page in range(1,30)]:
        response = requests.get(mlink)
        soup = BeautifulSoup(response.text,"lxml")
        itemlinks = [urljoin(link,container.select_one("h2.name a").get("href")) for container in soup.select(".profile")]
        threads = []
        for ilink in itemlinks:
            thread = threading.Thread(target=fetch_info,args=(ilink,))
            thread.start()
            threads+=[thread]

        for thread in threads:
            thread.join()

def fetch_info(nlink):
    response = requests.get(nlink)
    soup = BeautifulSoup(response.text,"lxml")
    for container in soup.select(".specials"):
        try:
            hours = container.select_one("h3").text
        except Exception: hours = ""
        try:
            fspecial = ' '.join([item.text for item in container.select(".special")])
        except Exception: fspecial = ""
        print(f'{hours}---{fspecial}')

if __name__ == '__main__':
    get_info(url)

If you want to set maximum number of parallel requests, you might need to implement ThreadingPool. Check [this ticket](https://stackoverflow.com/questions/3033952/threading-pool-similar-to-the-multiprocessing-pool) — Andersson, Oct 01 '18 at 15:00
This is exactly what I was looking for @sir Andersson. Could you help me implement that when you are free? I highly doubt I can do it myself. Thanks in advance. — SIM, Oct 01 '18 at 15:50
It seem that target website has some performance issue or some kind of bot-protection: when I try to run script server quickly becomes unresponsive, so it's hard to apply multi-threading web-scraping to this particular site... — Andersson, Oct 02 '18 at 09:03
Sorry, I can't help much, they blocked me too. Andersson is right, they have a bot-protection mechanism - if you look at their Server header you'll see that they're using Sucuri/Cloudproxy. Of course you could use a proxy until it gets banned, and then a second proxy, and so on, but that's abusive. I'm afraid that there is not much we can do but leave them alone. — t.m.adam, Oct 02 '18 at 15:20
That site is just a placeholder. However, I've already created one using `multiprocessing`. Thanks a lot both of you. Should i paste below the one I've created already? — SIM, Oct 02 '18 at 15:28
I've already posted one for your kind consideration @sir Andersson and t.m.adam. Just I need to know whether what I did is right. Inspired by [this blog post](http://blog.adnansiddiqi.me/how-to-speed-up-your-python-web-scraper-by-using-multiprocessing/?utm_source=medium_post_multiprocessing&utm_campaign=c_post_multiprocessing&utm_medium=medium_site). Thanks. — SIM, Oct 02 '18 at 15:52

SocketPlayer · Answer 1 · 2018-10-01T18:49:05.120

You should look at asyncio, it's great simple and can help you do things faster!

Also multiprocessing.Pool can simplified your code (in case you don't want to use asyncio). multiprocessing.pool also have ThreadPool equivalent if you prefer to to use threads.

About the requests limit, I recommend you to use threading.Semaphore (or any other semaphore in case you switch from threading)

threading approach:

from multiprocessing.pool import ThreadPool as Pool
from threading import Semaphore
from time import sleep


MAX_RUN_AT_ONCE = 5
NUMBER_OF_THREADS = 10

sm = Semaphore(MAX_RUN_AT_ONCE)


def do_task(number):
    with sm:
        print(f"run with {number}")
        sleep(3)
        return number * 2


def main():

    p = Pool(NUMBER_OF_THREADS)
    results = p.map(do_task, range(10))
    print(results)


if __name__ == '__main__':
    main()

multiprocessing approach:

from multiprocessing import Pool
from multiprocessing import Semaphore
from time import sleep


MAX_RUN_AT_ONCE = 5
NUMBER_OF_PROCESS = 10

semaphore = None

def initializer(sm):
    """init the semaphore for the child process"""
    global semaphore
    semaphore = sm


def do_task(number):
    with semaphore:
        print(f"run with {number}\n")
        sleep(3)
        return number * 2


def main():
    sm = Semaphore(MAX_RUN_AT_ONCE)
    p = Pool(NUMBER_OF_PROCESS, initializer=initializer,
             initargs=[sm])

    results = p.map(do_task, range(10))
    print(results)


if __name__ == '__main__':
    main()

asyncio approch:

import asyncio


MAX_RUN_AT_ONCE = 5
sm = asyncio.Semaphore(MAX_RUN_AT_ONCE)

async def do_task(number):
    async with sm:
        print(f"run with {number}\n")
        await asyncio.sleep(3)
        return number * 2

async def main():   
    coros = [do_task(number) for number in range(10)]
    finished, _  = await asyncio.wait(coros)
    print([fut.result() for fut in finished])

if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main())

for conducting http requests with asyncio you should use aiohttp, you can also use requests with loop.run_in_executor but then just don't use asyncio at all, because all your code is pretty much requests.

output:

run with 0

run with 1

run with 2

run with 3

run with 4

(here there is a pause du to the semaphore and sleep)

run with 5

run with 6

run with 7

run with 8

run with 9

[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

you can also check ThreadPoolExecutor

score 2 · Accepted Answer · answered Oct 02 '18 at 15:44

As I'm very new to create any scraper using multiprocessing, I expected to have any real-life script in order to understand the logic very clearly. The site used within the script has some bot protection mechanism. However, i've found out a very similar webpage to apply multiprocessing within it.

import requests
from multiprocessing import Pool
from urllib.parse import urljoin
from bs4 import BeautifulSoup

url = "http://srar.com/roster/index.php?agent_search={}"

def get_links(link):
    completelinks = []
    for ilink in [chr(i) for i in range(ord('a'),ord('d')+1)]:
        res = requests.get(link.format(ilink))  
        soup = BeautifulSoup(res.text,'lxml')
        for items in soup.select("table.border tr"):
            if not items.select("td a[href^='index.php?agent']"):continue
            data = [urljoin(link,item.get("href")) for item in items.select("td a[href^='index.php?agent']")]
            completelinks.extend(data)
    return completelinks

def get_info(nlink):
    req = requests.get(nlink)
    sauce = BeautifulSoup(req.text,"lxml")
    for tr in sauce.select("table[style$='1px;'] tr"):
        table = [td.get_text(strip=True) for td in tr.select("td")]
        print(table)

if __name__ == '__main__':
    allurls = get_links(url)
    with Pool(10) as p:  ##this is the number responsible for limiting the number of requests
        p.map(get_info,allurls)
        p.join()

score 0 · Answer 3 · answered Oct 02 '18 at 19:14

Although I'm not sure I could implement the logic of ThreadPool within the following script which has already been described in SocketPlayer's answer, It seems to be working flawlessly. Feel free to rectify, If I went anywhere wrong.

import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
from multiprocessing.pool import ThreadPool as Pool
from threading import Semaphore

MAX_RUN_AT_ONCE = 5
NUMBER_OF_THREADS = 10

sm = Semaphore(MAX_RUN_AT_ONCE)

url = "http://srar.com/roster/index.php?agent_search={}"

def get_links(link):
    with sm:
        completelinks = []
        for ilink in [chr(i) for i in range(ord('a'),ord('d')+1)]:
            res = requests.get(link.format(ilink))  
            soup = BeautifulSoup(res.text,'lxml')
            for items in soup.select("table.border tr"):
                if not items.select("td a[href^='index.php?agent']"):continue
                data = [urljoin(link,item.get("href")) for item in items.select("td a[href^='index.php?agent']")]
                completelinks.extend(data)
        return completelinks

def get_info(nlink):
    req = requests.get(nlink)
    sauce = BeautifulSoup(req.text,"lxml")
    for tr in sauce.select("table[style$='1px;'] tr")[1:]:
        table = [td.get_text(strip=True) for td in tr.select("td")]
        print(table)

if __name__ == '__main__':
    p = Pool(NUMBER_OF_THREADS)
    p.map(get_info, get_links(url))

Can't modify my script to delimit the number of requests while scraping

3 Answers3

Linked