How to wait to page to fully load using requests_html

Question

While accessing this link https://www.dickssportinggoods.com/f/tents-accessories?pageNumber=2 with requests_html, i need to wait to wait some time before the page actually loads. Is it possible with this? My code:

from requests_html import HTMLSession
from bs4 import BeautifulSoup
from lxml import etree

s = HTMLSession()
response = s.get(
    'https://www.dickssportinggoods.com/f/tents-accessories?pageNumber=2')
response.html.render()


soup = BeautifulSoup(response.content, "html.parser")
dom = etree.HTML(str(soup))
item = dom.xpath('//a[@class="rs_product_description d-block"]/text()')[0]
print(item)

Already answered https://stackoverflow.com/questions/60416507/python-requests-not-getting-full-page — wuzz, Sep 12 '21 at 10:06
That answer says to use "r.html.render()" and I am already doing that. — Ibtsam Ch, Sep 12 '21 at 10:11
@Ibstam Ch pip install requests-html from requests_html import HTMLSession from requests_html import AsyncHTMLSession — wuzz, Sep 12 '21 at 10:16

score 0 · Answer 1 · answered Sep 12 '21 at 10:04

0

You can induce Selenium as well in headless mode.

Selenium has the capability to wait unit elements are found with Explicit waits.

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

options = webdriver.ChromeOptions()
options.add_argument('--window-size=1920,1080')
options.add_argument("--headless")
driver = webdriver.Chrome(executable_path = driver_path, options = options)
driver.get("URL here")
wait = WebDriverWait(driver, 20)
wait.until(EC.visibility_of_element_located((By.XPATH, "//a[@class='rs_product_description d-block']")))

PS: You'd have to download chromedriver from here

answered Sep 12 '21 at 10:04

cruisepandey

28,520
6
20
38

yes but I want to avoid selenium. isn't there any other way? – Ibtsam Ch Sep 12 '21 at 10:05
What is the reason you do not wanna use Selenium ? – cruisepandey Sep 12 '21 at 10:07
because its slow and inconsistent too. Many a times it would be working fine but as soon I add headless argument it stops working. – Ibtsam Ch Sep 12 '21 at 10:12
1

selenium is just bad. op is right. it's heavy, especially if you want to perform a simple task. why force people to use selenium? its un-deployable and problematic – greendino Apr 24 '22 at 13:04

balderman · Accepted Answer · 2021-09-12T10:38:45.640

0

It looks like the data you are looking for can be fetched using HTTP GET to
https://prod-catalog-product-api.dickssportinggoods.com/v2/search?searchVO=%7B%22selectedCategory%22%3A%2212301_1809051%22%2C%22selectedStore%22%3A%220%22%2C%22selectedSort%22%3A1%2C%22selectedFilters%22%3A%7B%7D%2C%22storeId%22%3A15108%2C%22pageNumber%22%3A2%2C%22pageSize%22%3A48%2C%22totalCount%22%3A112%2C%22searchTypes%22%3A%5B%22PINNING%22%5D%2C%22isFamilyPage%22%3Atrue%2C%22appliedSeoFilters%22%3Afalse%2C%22snbAudience%22%3A%22%22%2C%22zipcode%22%3A%22%22%7D

The call will return a JSON and you can use that direcly with zero scraping code.

Copy/Paste the URL into the browser --> see the data.

You can specify the page number in the url:

searchVO={"selectedCategory":"12301_1809051","selectedStore":"0","selectedSort":1,"selectedFilters":{},"storeId":15108,"pageNumber":2,"pageSize":48,"totalCount":112,"searchTypes":["PINNING"],"isFamilyPage":true,"appliedSeoFilters":false,"snbAudience":"","zipcode":""}

working code below

import requests
import pprint

page_num = 2
url = f'https://prod-catalog-product-api.dickssportinggoods.com/v2/search?searchVO=%7B%22selectedCategory%22%3A%2212301_1809051%22%2C%22selectedStore%22%3A%220%22%2C%22selectedSort%22%3A1%2C%22selectedFilters%22%3A%7B%7D%2C%22storeId%22%3A15108%2C%22pageNumber%22%3A2%2C%2{page_num}pageSize%22%3A48%2C%22totalCount%22%3A112%2C%22searchTypes%22%3A%5B%22PINNING%22%5D%2C%22isFamilyPage%22%3Atrue%2C%22appliedSeoFilters%22%3Afalse%2C%22snbAudience%22%3A%22%22%2C%22zipcode%22%3A%22%22%7D'

r = requests.get(url)
if r.status_code == 200:
    pprint.pprint(r.json())

edited Sep 12 '21 at 10:38

answered Sep 12 '21 at 10:18

balderman

22,927
7
34
52

Yes but In this request I cant specify page number – Ibtsam Ch Sep 12 '21 at 10:24
1

@IbtsamCh - the answer was updated with working code that lets you specify the page number. Enjoy :-) – balderman Sep 12 '21 at 10:39
Hey sorry but I want to change the select category too for categories available on https://www.dickssportinggoods.com/c/camping-hiking-gear. Is there any way to find those category numbers? I cant find them anywhere – Ibtsam Ch Sep 12 '21 at 11:20
Look at `searchVO` dict in my answer. There you can find the available arguments. – balderman Sep 12 '21 at 11:24
Yes we can change it definitely but I cant find the category numbers for other categories anywehre except the api url like this category has "selectedCategory":"12301_1809051". this value is change for other categories – Ibtsam Ch Sep 12 '21 at 11:26
So your question is "how do I know other cat. ids?" - if the answer is yes here is what you should do. Browse the website and "play" with other cat. At the same time use dev tools (F12) -> Netwoek -> XHR and look for the HTTP request that starts with `https://prod-catalog-product-api.dickssportinggoods.com/v2/search` - there you will find it. – balderman Sep 12 '21 at 11:29
yes I can find that way definitely. but i wanted to automate it instead of finding the cat ids for every category – Ibtsam Ch Sep 12 '21 at 11:31
why dont you prepare a dict that looks like: `{'cat1': 12,'cat6':234}` and use it in run time? (12.& 234 are the ids in this example) – balderman Sep 12 '21 at 11:32
sadly this only answer a specific web the op facing – greendino Apr 24 '22 at 13:06

How to wait to page to fully load using requests_html

2 Answers2