Scraping web page taking more time using Selenium BeautifulSoup in Python

Question

I scrape the product links <a href=""> from the products page and store them in array hrefs

from bs4 import BeautifulSoup
from selenium import webdriver
import os
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--headless")
service = webdriver.chrome.service.Service(executable_path=os.getcwd() + "./chromedriver.exe")
driver = webdriver.Chrome(service=service, options=chrome_options)
driver.set_page_load_timeout(900)
link = 'https://www.catch.com.au/seller/vdoo/products.html?page=1'
driver.get(link)
soup = BeautifulSoup(driver.page_source, 'lxml')
product_links = soup.find_all("a", class_="css-1k3ukvl")

hrefs = []
for product_link in product_links:
    href = product_link.get("href")
    if href.startswith("/"):
        href = "https://www.catch.com.au" + href
    hrefs.append(href)

there are about 36 links store in array of all 36 products on the page, then I start to pick each link from hrefs and go to it and scrap further data from each link.

products = []
for href in hrefs:
    driver.get(href)
    soup = BeautifulSoup(driver.page_source, 'lxml')
    
    title = soup.find("h1", class_="e12cshkt0").text.strip()
    price = soup.find("span", class_="css-1qfcjyj").text.strip()
    image_link = soup.find("img", class_="css-qvzl9f")["src"]
    product = {
        "title": title,
        "price": price,
        "image_link": image_link
    }
    products.append(product)
driver.quit()
print(len(products))

But it takes too much time. I have already set 900 secs but timeouts. Problems:

At start, for now, I am just fetching product links from first page but I have more pages, like upto 40 pages, and 36 products on each page. When I implement to get data from all pages, it also timeouts.
Then in second part when I use those links and scrap every link, then it also takes more time. How can I reduce the execution time of this program. Can I divide the program in some parts?

Take the problems one at a time. Are you saying you get a 900 second timeout on opening the `link` page? — JonSG, May 12 '23 at 13:49
@JonSG Currently I am just getting URLs of products from first page. It works but also takes lot of time. But when I start to scrap each link from `hrefs` array and scrap each product(36 products), then timeouts . Also when I implement to go through all pages(~40 pages), then it timesout. The whole program take a lot of time. Want to fast the execution time — Hashir, May 12 '23 at 14:03
ah, I see. So the issue is more related to timeouts when scraping lots of pages then a timeout on any individual page. Also, you more often see timeouts on product pages rather than on the listing pages. Right? — JonSG, May 12 '23 at 14:16
@JonSG Yes exactly. Currently I am getting data from first page of products only that's why I get the results and it timeouts in listings pages. But whenever I implement to get data(links) from all pages of products, it timeouts on that process and never reach to listing pages process — Hashir, May 12 '23 at 14:35
check out the answers here: https://stackoverflow.com/questions/42732958/python-parallel-execution-with-selenium — JonSG, May 12 '23 at 15:18

score 2 · Accepted Answer · answered May 12 '23 at 18:40

You can skip the selenium and obtain the results directly using their Ajax API. For example:

import requests
from bs4 import BeautifulSoup

api_url = "https://www.catch.com.au/seller/vdoo/products.json"

params = {
    "page": 1,  # <-- to get other pages, increase this parameter
}

data = requests.get(api_url, params=params).json()

urls = []
for r in data['payload']['results']:
    urls.append(f"https://www.catch.com.au{r['product']['productPath']}")

for url in urls:
    soup = BeautifulSoup(requests.get(url).content, 'html.parser')
    price = soup.select_one('[itemprop=price]')['content']
    title = soup.h1.text
    print(f'{title:<100} {price:<5}')

Prints:

2x Pure Natural Cotton King Size Pillow Case Cover Slip - 54x94cm - White                            46.99
Fire Starter Lighter Waterproof Flint Match Metal Keychain Camping Survival - Gold                   20.89
Plain Solid Colour Cushion Cover Covers Decorative Pillow Case - Apple Green                         20.9 
2000TC 4PCS Bed Sheet Set Flat Fitted Pillowcase Single Double Queen King Bed - Black                57.18
All Size Bed Ultra Soft Quilt Duvet Doona Cover Set Bedding - Paris Eiffel Tower                     50.99

...and so on.

Thank you for helping. This is just another level that I could take time to understand:) Some how it is returning data and will manage it according to my need — Hashir, May 12 '23 at 20:16

Scraping web page taking more time using Selenium BeautifulSoup in Python

1 Answers1