0

I created a webscraper that uses Selenium and BeautifulSoup to scrape the name and price of products on Coolblue.nl (a Dutch online electronics marketplace) using a product-specific URL. Even though I made the webbrowser headless, it still takes quite some time to get the product name and price. How can I speed up the webscraping process? Is Selenium the right way to go about this?

from bs4 import BeautifulSoup
from selenium import webdriver
import json



def get_price(url):
    # Tweaks options of the ChromeDriver used by Selenium
    options = webdriver.ChromeOptions()
    # Disable the GPU and make a 'headless' request to the URL
    options.add_argument('headless')
    options.add_argument('disable-gpu')
    # Headless requests are often blocked by websites. The 'user-agent' is set to read like a request sent by a browser.
    options.add_argument('user-agent=Mozilla/5.0 (Windows NT 10.0;Win64) AppleWebkit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36')
    # ChromeDriver is launched with options added as an argument
    driver = webdriver.Chrome(options=options)
    # Driver send a get request to the URL
    driver.get(url)
    # Driver gets the page source data
    data_source = driver.page_source
    # BeautifulSoup parses the page source data with the HTML parser
    parser = BeautifulSoup(data_source, 'html.parser')
    # The HTML parser finds the name and current price of the product and returns it using json string commands
    json_str = parser.find('script', {'type': 'application/ld+json'}).get_text()
    json_str = json_str.replace('\n', '').replace('\t', '')
    data = json.loads(json_str)
    name = data['name']
    price = data['offers']['price']
    return str(name), float(price)

I have tried using requests, but the request to Coolblue.nl does not work.

stilts15
  • 21
  • 1
  • I got it to work with requests by setting the user-agent to 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36' – stilts15 Jul 24 '23 at 22:00

1 Answers1

0

Precisely, to answer if Selenium the right way to scrape any webapplication, would much depend on the underlying application architecture.

If you have to scrap pages with static elements, Beautifulsoup would provide the much needed performance. But if the elements on the page are dynamically generated either through:

Then you need to allow the dynamic components to get rendered within the HTML DOM. In those cases, there can't be any better approach then using Selenium.

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
  • Thanks. The page has static elements, but when I use the requests module it does not return anything - I get a 403 error. Is there anyway to avoid this, or is Selenium the only option in that case? – stilts15 Jul 24 '23 at 20:35
  • @stilts15 Use Selenium to initiate the session and then use Beautifulsoup to scrape. – undetected Selenium Jul 24 '23 at 20:49
  • Thanks again for your help and quick response. I actually got it to work with requests. I set the user-agent to 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36' and now it works. – stilts15 Jul 24 '23 at 21:59