I created a webscraper that uses Selenium and BeautifulSoup to scrape the name and price of products on Coolblue.nl (a Dutch online electronics marketplace) using a product-specific URL. Even though I made the webbrowser headless, it still takes quite some time to get the product name and price. How can I speed up the webscraping process? Is Selenium the right way to go about this?
from bs4 import BeautifulSoup
from selenium import webdriver
import json
def get_price(url):
# Tweaks options of the ChromeDriver used by Selenium
options = webdriver.ChromeOptions()
# Disable the GPU and make a 'headless' request to the URL
options.add_argument('headless')
options.add_argument('disable-gpu')
# Headless requests are often blocked by websites. The 'user-agent' is set to read like a request sent by a browser.
options.add_argument('user-agent=Mozilla/5.0 (Windows NT 10.0;Win64) AppleWebkit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36')
# ChromeDriver is launched with options added as an argument
driver = webdriver.Chrome(options=options)
# Driver send a get request to the URL
driver.get(url)
# Driver gets the page source data
data_source = driver.page_source
# BeautifulSoup parses the page source data with the HTML parser
parser = BeautifulSoup(data_source, 'html.parser')
# The HTML parser finds the name and current price of the product and returns it using json string commands
json_str = parser.find('script', {'type': 'application/ld+json'}).get_text()
json_str = json_str.replace('\n', '').replace('\t', '')
data = json.loads(json_str)
name = data['name']
price = data['offers']['price']
return str(name), float(price)
I have tried using requests, but the request to Coolblue.nl does not work.