I am using a very simple script to scrape information from a public discussion forum. It currently takes around 2 minutes per URL to scrape, and there are 20,000 URLs.
Is there a way to speed up this process?
from bs4 import BeautifulSoup
from selenium import webdriver
urls = ['url1', 'url2', ...]
for url in urls:
page = webdriver.Chrome()
page.get(url)
soup = BeautifulSoup(page.page_source,"lxml")
messages = soup.findAll("div", class_="bbWrapper")
for message in messages:
print(message.text)
page.quit()
- I have used Selenium to avoid the following error:
To continue your browser has to accept cookies and has to have JavaScript enabled
- I have tried to run Chrome headless, but get blocked by Cloudflare
- I have read that Selenium Stealth can avoid the Cloudflare block, but I do not know how to install Selenium Stealth in the Anaconda-Python environment
None of Access Denied page with headless Chrome on Linux while headed Chrome works on windows using Selenium through Python or How to automate login to a site which is detecting my attempts to login using selenium-stealth or Can a website detect when you are using Selenium with chromedriver? answer the question as none of those are about improving performance.