I suggest build a multithreaded program to make requests. concurrent.futures
is one of the easiest ways to multithread these kinds of requests, in particular using the ThreadPoolExecutor. They even have a simple multithreaded URL request example in the documentation.
here is a sample code using bs4
and concurrent.futures
import time
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor, as_completed
URLs = [ ... ] # A long list of URLs.
def parse(url):
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
return soup.find_all('a')
# run 10 workers concurrently, it depends on the number of core/threads of your processor
with ThreadPoolExecutor(max_workers=10) as executor:
start = time.time()
futures = [ executor.submit(parse, url) for url in URLs ]
results = []
for result in as_completed(futures):
results.append(result)
end = time.time()
print("Time Taken: {:.6f}s".format(end-start))
Also, you may want to check out python scrapy framework, it will scrape the data concurrently and very easy to learn, also it comes with many features such as auto-throttle, rotating proxies and user-agents, you can easily integrate with your databases as well.