0

I am trying to scrape "the north face" website for a group project and I am looking for a faster way to get the output faster. Is there any faster way without opening a chrome web page every time I am getting the html of a page ? I can't use requests cause it doesn't give me the FULL source code. Thank for the help. This is what I have:

import requests
from bs4 import BeautifulSoup
from helium import *
import time

# To tell the API that I am a user using Google Chrome.
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"}
# open Chrome in the back ground.
browser = start_chrome("https://www.thenorthface.com/shop/mens-jackets-vests-en-ca#facet=&beginIndex=0", headless=True)
# Click on the "LOAD MORE" button to load all the products in the page.
while Text("LOAD MORE").exists():
    click("LOAD MORE")
    time.sleep(2.0)

# get the html source of the page
html = browser.page_source
kill_browser()
# creat a soup object
soup = BeautifulSoup(html, "html.parser")
# print(soup.prettify())
# soup object for all products
products_cards = soup.find_all("div", {"class": "product-block-info info info-js"})
# print(products_cards)

products_names = []
products_links = []
products_prices = []
for card in products_cards:
    for name in card.find_all("div", {"class": "product-block-name name name-js"}):
        for i in name.find_all("a", class_="product-block-name-link"):
            # print(i.get("title"))
            products_names.append(i.get("title"))
            # print(i.get("href"))
            products_links.append(i.get("href"))

# soup object for specific product
# product_soup = BeautifulSoup(html, "html.parser")
#!!!!!!!!!!!!!!!!

for jacket_url in products_links[:3]:
    browser = start_chrome(jacket_url, headless=True)
    html = browser.page_source
    kill_browser()
    product_soup = BeautifulSoup(html, "html.parser")
    price_info = product_soup.find_all("div", class_="product-content-info-price product-price product-price-js")
    for info in price_info:
        for price in info.find("span", "product-content-info-offer-price offer-price offer-price-js product-price-amount-js"):
            products_prices.append(price)


print(len(products_prices))
print(len(products_names))
print(len(products_links)) ```
  • Check out 'headless-mode' browser (https://stackoverflow.com/questions/46920243/how-to-configure-chromedriver-to-initiate-chrome-browser-in-headless-mode-throug). – DaveIdito Nov 15 '20 at 10:10
  • @DaveIdito He is already using `headless` – Abhishek Rai Nov 15 '20 at 10:11
  • I don't think he is looking for a faster parser which is `lxml` . He is looking for a quicker way to get the full HTML. I don't think there is..So, I am following this question as well. I don't know if `urllib` is the answer. – Abhishek Rai Nov 15 '20 at 10:12
  • OP, it is not clear what your problem is- 1. You can't get the full HTML or 2. It's too slow? – DaveIdito Nov 15 '20 at 10:50
  • @Daveldito Sorry for that, I managed to get the full html but get the price that I need using helium, but the issue I guess is that it's too slow to go inside every product link and get the html from it then using Beautiful soup to parse it and get the info I need. – Manaf Albarghash Nov 15 '20 at 21:25
  • Have you checked 1) if there is a public API? 2) the network tab to see if price info comes from an xhr you can mimic? (Possibly when pressing load more as there is likely a POST request) – QHarr Nov 16 '20 at 06:10
  • You are right and I have decided to change websites actually because their website is filled with JavaScript and it's not easy to pull the page source from. Requests doesn't give you the javascript and even helium didn't give me everything. My life is way easier with the new website that I picked. Note I am doing this for a school project. I figured that previously my code was correct, but slow as hell because I am using helium "which opens a chrome or firefox tab in the background every time". Thanks for the taking the time to help me everyone :) – Manaf Albarghash Nov 17 '20 at 01:44

0 Answers0