0

I am using a very simple script to scrape information from a public discussion forum. It currently takes around 2 minutes per URL to scrape, and there are 20,000 URLs.

Is there a way to speed up this process?

from bs4 import BeautifulSoup
from selenium import webdriver

urls = ['url1', 'url2', ...]
for url in urls:
    page = webdriver.Chrome()
    page.get(url)
    
    soup = BeautifulSoup(page.page_source,"lxml")
    messages = soup.findAll("div", class_="bbWrapper")
        
    for message in messages:
        print(message.text)
    
    page.quit()
  • I have used Selenium to avoid the following error: To continue your browser has to accept cookies and has to have JavaScript enabled
  • I have tried to run Chrome headless, but get blocked by Cloudflare
  • I have read that Selenium Stealth can avoid the Cloudflare block, but I do not know how to install Selenium Stealth in the Anaconda-Python environment

None of Access Denied page with headless Chrome on Linux while headed Chrome works on windows using Selenium through Python or How to automate login to a site which is detecting my attempts to login using selenium-stealth or Can a website detect when you are using Selenium with chromedriver? answer the question as none of those are about improving performance.

user16217248
  • 3,119
  • 19
  • 19
  • 37
Dave
  • 9
  • 2
  • 1
    Yes, stealth chrome acts like a regular browser, I used it many times in my projects. I run scripts from terminal on OSX, also under ENV, that way I don't run into issues as much. If your scraping consists of different URLS, you could run multiple tabs open, or multiple chrome drivers. I never tried multi threading in selenium, I use it a lot in typical scripts with requests, bs4 etc – P_n Jul 15 '23 at 15:25

1 Answers1

-1

Here are a few suggestions to enhance your code:

  1. Avoid instantiating Chrome for each URL. Move the page = webdriver.Chrome() and page.quit() outside the loop to reuse the browser instance efficiently.
  2. Divide the process into two steps. First, retrieve and save the HTML content for each URL. Then, perform the parsing separately.
  3. Consider implementing multithreading by exploring the threading module. It can help optimize the execution of multiple tasks concurrently.
Santer
  • 1
  • 1
  • 2
    Thanks, ChatGPT gave me these suggestions already. – Dave Jul 15 '23 at 15:08
  • 1
    This answer looks like it was generated by an AI (like ChatGPT), not by an actual human being. You should be aware that [posting AI-generated output is officially **BANNED** on Stack Overflow](https://meta.stackoverflow.com/q/421831). If this answer was indeed generated by an AI, then I strongly suggest you delete it before you get yourself into even bigger trouble: **WE TAKE PLAGIARISM SERIOUSLY HERE.** Please read: [Why posting GPT and ChatGPT generated answers is not currently allowed](https://stackoverflow.com/help/gpt-policy). – tchrist Jul 15 '23 at 20:57