I scrape the product links <a href="">
from the products page and store them in array hrefs
from bs4 import BeautifulSoup
from selenium import webdriver
import os
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--headless")
service = webdriver.chrome.service.Service(executable_path=os.getcwd() + "./chromedriver.exe")
driver = webdriver.Chrome(service=service, options=chrome_options)
driver.set_page_load_timeout(900)
link = 'https://www.catch.com.au/seller/vdoo/products.html?page=1'
driver.get(link)
soup = BeautifulSoup(driver.page_source, 'lxml')
product_links = soup.find_all("a", class_="css-1k3ukvl")
hrefs = []
for product_link in product_links:
href = product_link.get("href")
if href.startswith("/"):
href = "https://www.catch.com.au" + href
hrefs.append(href)
there are about 36 links store in array of all 36 products on the page, then I start to pick each link from hrefs
and go to it and scrap further data from each link.
products = []
for href in hrefs:
driver.get(href)
soup = BeautifulSoup(driver.page_source, 'lxml')
title = soup.find("h1", class_="e12cshkt0").text.strip()
price = soup.find("span", class_="css-1qfcjyj").text.strip()
image_link = soup.find("img", class_="css-qvzl9f")["src"]
product = {
"title": title,
"price": price,
"image_link": image_link
}
products.append(product)
driver.quit()
print(len(products))
But it takes too much time. I have already set 900 secs but timeouts. Problems:
- At start, for now, I am just fetching product links from first page but I have more pages, like upto 40 pages, and 36 products on each page. When I implement to get data from all pages, it also timeouts.
- Then in second part when I use those links and scrap every link, then it also takes more time. How can I reduce the execution time of this program. Can I divide the program in some parts?