I've written a script in python using Thread
to handl multiple requests at the same time and do the scraping process faster. The script is doing it's job accordingly.
In short what the scraper does: It parses all the links from the landing page leading to its main page (where information are stored) and scrape
happy hours
andfeatured special
from there. The scrapers keeps going on until all the 29 pages are crawled.
As there may be numerous links to play with, I would like to limit the number of requests. However, as I don't have much idea on this I can't find any ideal way to modify my existing script to serve the purpose.
Any help will be vastly appreciated.
This is my attempt so far:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import threading
url = "https://www.totalhappyhour.com/washington-dc-happy-hour/?page={}"
def get_info(link):
for mlink in [link.format(page) for page in range(1,30)]:
response = requests.get(mlink)
soup = BeautifulSoup(response.text,"lxml")
itemlinks = [urljoin(link,container.select_one("h2.name a").get("href")) for container in soup.select(".profile")]
threads = []
for ilink in itemlinks:
thread = threading.Thread(target=fetch_info,args=(ilink,))
thread.start()
threads+=[thread]
for thread in threads:
thread.join()
def fetch_info(nlink):
response = requests.get(nlink)
soup = BeautifulSoup(response.text,"lxml")
for container in soup.select(".specials"):
try:
hours = container.select_one("h3").text
except Exception: hours = ""
try:
fspecial = ' '.join([item.text for item in container.select(".special")])
except Exception: fspecial = ""
print(f'{hours}---{fspecial}')
if __name__ == '__main__':
get_info(url)