request.get() not getting entire dom

Question

I'm scraping a website, but some products don't appear in the DOM unless you scroll down. For instance take a look at this page.

When I store the DOM inside a variable and try to get the divs corresponding to the products:

req = requests.get(*url*,verify=False)
soup = BeautifulSoup(req.text,'html.parser')
product_list = soup.findAll("div",class_="product-block")

product_list only contains 24 elements (instead of 91, the number of products in that page if you scroll down completely). How can I store the complete DOM inside req?

NB. I'm not sure if that is the reason for the products non appearing in product_list, but this is the interpretation I give since, when I inspect the DOM with firefox, if I don't scroll down, I only see 24 <div class="product-block ...">, not 91.

The `requests` library loads HTML content with `JavaScript` turned off. You need to use a browser automation tool such as `selenium`. — gtlambert, Jan 15 '16 at 12:30

score 2 · Answer 1 · edited May 23 '17 at 12:04

The solution is very page specific but it should work. While inspecting the load process it turns out that as soon as you scroll down, the browser is performing an AJAX request to https://www.project6ny.com/collections/all-childrens-accessories?page=2. If you visits that URL you will actually see the second page for the catalog.

As you can determine the max number of pages (it is in the element , the penultimate element), you can apply the solution here for scrapping the paginated catalog.

request.get() not getting entire dom

1 Answers1