0

While accessing this link https://www.dickssportinggoods.com/f/tents-accessories?pageNumber=2 with requests_html, i need to wait to wait some time before the page actually loads. Is it possible with this? My code:

from requests_html import HTMLSession
from bs4 import BeautifulSoup
from lxml import etree

s = HTMLSession()
response = s.get(
    'https://www.dickssportinggoods.com/f/tents-accessories?pageNumber=2')
response.html.render()


soup = BeautifulSoup(response.content, "html.parser")
dom = etree.HTML(str(soup))
item = dom.xpath('//a[@class="rs_product_description d-block"]/text()')[0]
print(item)

Ibtsam Ch
  • 383
  • 1
  • 8
  • 22

2 Answers2

0

You can induce Selenium as well in headless mode.

Selenium has the capability to wait unit elements are found with Explicit waits.

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

options = webdriver.ChromeOptions()
options.add_argument('--window-size=1920,1080')
options.add_argument("--headless")
driver = webdriver.Chrome(executable_path = driver_path, options = options)
driver.get("URL here")
wait = WebDriverWait(driver, 20)
wait.until(EC.visibility_of_element_located((By.XPATH, "//a[@class='rs_product_description d-block']")))

PS: You'd have to download chromedriver from here

cruisepandey
  • 28,520
  • 6
  • 20
  • 38
0

It looks like the data you are looking for can be fetched using HTTP GET to
https://prod-catalog-product-api.dickssportinggoods.com/v2/search?searchVO=%7B%22selectedCategory%22%3A%2212301_1809051%22%2C%22selectedStore%22%3A%220%22%2C%22selectedSort%22%3A1%2C%22selectedFilters%22%3A%7B%7D%2C%22storeId%22%3A15108%2C%22pageNumber%22%3A2%2C%22pageSize%22%3A48%2C%22totalCount%22%3A112%2C%22searchTypes%22%3A%5B%22PINNING%22%5D%2C%22isFamilyPage%22%3Atrue%2C%22appliedSeoFilters%22%3Afalse%2C%22snbAudience%22%3A%22%22%2C%22zipcode%22%3A%22%22%7D

The call will return a JSON and you can use that direcly with zero scraping code.

Copy/Paste the URL into the browser --> see the data.

You can specify the page number in the url:

searchVO={"selectedCategory":"12301_1809051","selectedStore":"0","selectedSort":1,"selectedFilters":{},"storeId":15108,"pageNumber":2,"pageSize":48,"totalCount":112,"searchTypes":["PINNING"],"isFamilyPage":true,"appliedSeoFilters":false,"snbAudience":"","zipcode":""}

working code below

import requests
import pprint

page_num = 2
url = f'https://prod-catalog-product-api.dickssportinggoods.com/v2/search?searchVO=%7B%22selectedCategory%22%3A%2212301_1809051%22%2C%22selectedStore%22%3A%220%22%2C%22selectedSort%22%3A1%2C%22selectedFilters%22%3A%7B%7D%2C%22storeId%22%3A15108%2C%22pageNumber%22%3A2%2C%2{page_num}pageSize%22%3A48%2C%22totalCount%22%3A112%2C%22searchTypes%22%3A%5B%22PINNING%22%5D%2C%22isFamilyPage%22%3Atrue%2C%22appliedSeoFilters%22%3Afalse%2C%22snbAudience%22%3A%22%22%2C%22zipcode%22%3A%22%22%7D'

r = requests.get(url)
if r.status_code == 200:
    pprint.pprint(r.json())
balderman
  • 22,927
  • 7
  • 34
  • 52
  • Yes but In this request I cant specify page number – Ibtsam Ch Sep 12 '21 at 10:24
  • 1
    @IbtsamCh - the answer was updated with working code that lets you specify the page number. Enjoy :-) – balderman Sep 12 '21 at 10:39
  • Hey sorry but I want to change the select category too for categories available on https://www.dickssportinggoods.com/c/camping-hiking-gear. Is there any way to find those category numbers? I cant find them anywhere – Ibtsam Ch Sep 12 '21 at 11:20
  • Look at `searchVO` dict in my answer. There you can find the available arguments. – balderman Sep 12 '21 at 11:24
  • Yes we can change it definitely but I cant find the category numbers for other categories anywehre except the api url like this category has "selectedCategory":"12301_1809051". this value is change for other categories – Ibtsam Ch Sep 12 '21 at 11:26
  • So your question is "how do I know other cat. ids?" - if the answer is yes here is what you should do. Browse the website and "play" with other cat. At the same time use dev tools (F12) -> Netwoek -> XHR and look for the HTTP request that starts with `https://prod-catalog-product-api.dickssportinggoods.com/v2/search` - there you will find it. – balderman Sep 12 '21 at 11:29
  • yes I can find that way definitely. but i wanted to automate it instead of finding the cat ids for every category – Ibtsam Ch Sep 12 '21 at 11:31
  • why dont you prepare a dict that looks like: `{'cat1': 12,'cat6':234}` and use it in run time? (12.& 234 are the ids in this example) – balderman Sep 12 '21 at 11:32
  • sadly this only answer a specific web the op facing – greendino Apr 24 '22 at 13:06