-1

I need to do pagination for this page:

I read this question and I try this:

scrolls = 10
while True:
    scrolls -= 1
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
    time.sleep(3)
    if scrolls < 0:
        break

I need to scroll down for getting all the products, but I don't know how many time I need to scroll for getting all the products.

I also tried to have a big screen

'SELENIUM_DRIVER_ARGUMENTS': ['--no-sandbox', '--window-size=1920,30000'],

and scroll down

time.sleep(10) 
self.driver.execute_script("window.scrollBy(0, 30000);")

Does someone have an Idea how to get all products ? I'm open to another solution, if Selenium is not the best for this case. Thanks.

UPDATE 1: I need to have all product IDs. for having the product IDs I use this:

products = response.css('div.jfJiHa > .iepIep')
        for product in products:
            detail_link = product.css('a.jXwbaQ::attr("href")').get()
            product_id = re.findall(r'products/(\d+)', detail_link)[0]
parik
  • 2,313
  • 12
  • 39
  • 67
  • I think the key thing missing from your question is how you are extracting your data? I think the "seen" element disappear as you scroll down so you can't load everything on the page at once. – tomjn Apr 26 '21 at 15:39
  • @tomjn Yes do you have an Idea please ? – parik Apr 27 '21 at 08:24
  • Yes, but you didn’t answer my question. How are you extracting items? – tomjn Apr 27 '21 at 08:48
  • @tomjn I updated my question and answered your question, thanks – parik Apr 27 '21 at 09:19
  • Thanks, but can you give us the entire `parse` function or whatever you've called it rather than small snippets so we can see how the `selenium` code interacts with the `Selector` code? The entire spider would be even better if it isn't too much – tomjn Apr 27 '21 at 09:21
  • I think it's not necessary here, the question is about displaying all the products, I have no problem with the parsing part – parik Apr 27 '21 at 09:56
  • I think we really need to see the interaction of `scrapy` and `selenium`. If you just have a single `response.css` call as above it isn't going to work because that just sees the response from `scrapy` – tomjn Apr 27 '21 at 09:59

4 Answers4

1

Try scrolling visible screen height amount page down each time reading the presented products until the //button[@data-test='footer-feedback-button'] or any other element located on the bottom is visible

Prophet
  • 32,350
  • 22
  • 54
  • 79
1

This code may help -

from selenium import webdriver
from selenium.common.exceptions import StaleElementReferenceException
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait

driver = webdriver.Chrome()
wait = WebDriverWait(driver, 30)

driver.get('https://www.compraonline.bonpreuesclat.cat/products/search?q=pasta')

BaseDivs = driver.find_elements_by_xpath("//div[contains(@class,\"base__Wrapper\")]")

for div in BaseDivs:
    try:
        wait.until(EC.visibility_of_element_located((By.XPATH, "./descendant::img")))
        driver.execute_script("return arguments[0].scrollIntoView(true);", div)
    except StaleElementReferenceException:
        continue

This code will wait for the image to load and then focus on the element. This way it will automatically scroll down till the end of the page.

Mark it answer if this is what you are looking for.

Swaroop Humane
  • 1,770
  • 1
  • 7
  • 17
  • The problem of this is that some elements in middle of the page doesn't load. – parik Apr 25 '21 at 19:03
  • That's why you should use my solution :) – Prophet Apr 25 '21 at 19:08
  • Excuse me, I didn'T find the diffrence between your solution and what I tried, you don't have error because in your case you don't have a lot of product, If you try yith this url your solution won'T work https://www.compraonline.bonpreuesclat.cat/products/search?q=pasta – parik Apr 25 '21 at 19:11
  • I tried your solution, I have just the first 30 products – parik Apr 25 '21 at 19:27
  • It should work! You have to scroll in loop one page height each time until you get to the bottom. I saw the page you are working on. – Prophet Apr 25 '21 at 19:31
  • do you have all the products? – parik Apr 25 '21 at 19:34
  • I didn't try to write specifically for this page, however I did that several times on other, similar pages in the past. I have more than 6 years writing automation. – Prophet Apr 25 '21 at 19:52
  • Again, you should scroll down each time for the visual screen height or something like that. Not using document.body.scrollHeight. While paging down you can read the presented products. Until you reach the bottom. – Prophet Apr 25 '21 at 19:54
  • @parik I have edited my answer, please try and let me know if this is what you are looking for. Please mark it as answer if it answer your query. – Swaroop Humane Apr 25 '21 at 20:46
  • Yup, now you changed your answer according to my comments here :) But if you are iterating through the list of products you should jump for 6 elements in the list each time to jump for a row each time – Prophet Apr 25 '21 at 20:51
  • I have still just the first 30 products – parik Apr 25 '21 at 21:26
1

As commented, without seeing your whole spider it is hard to see where you are going wrong here, but if we assume that your parsing is using the scrapy response then that is why you are always just getting 30 products.

You need to create a new selector from the driver after each scroll and query that. A full example of code that gets 300 items from the page is

import re
import time
from pprint import pprint

import parsel
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver import Firefox

with Firefox() as driver:
    driver.get("https://www.compraonline.bonpreuesclat.cat/products/search?q=pasta")

    all_items = {}

    while True:
        sel = parsel.Selector(driver.page_source)
        for product in sel.css("div[data-test] h3 > a"):
            name = product.css("::text").get()
            product_id = re.search("(\d+)", product.attrib["href"]).group()
            all_items[product_id] = name
        try:
            element = driver.find_element_by_css_selector(
                "div[data-test] + div.iepIep:not([data-test])"
            )
        except NoSuchElementException:
            break
        driver.execute_script("arguments[0].scrollIntoView(true);", element)
        time.sleep(1)

    pprint(all_items)
    print("Number of items =", len(all_items))

The key bits of this

  • After getting the page using driver.get we start looping
  • We create a new Selector (here I directly use parsel.Selector which is what scrapy uses internally)
  • We extract the info we need. Displayed products all have a data-test attribute. If this was a scrapy.Spider I'd yield the information, but here I just add it to a dictionary of all items.
  • After getting all the visible items, we try to find the first following sibling of a div with a data-test attribute , that doesn't have a data-test attribute (using the css + symbol)
  • If no such element exists (because we have seen all items) then break out of the loop, otherwise scroll that element into view and pause a second
  • Repeat until all items have been parsed
tomjn
  • 5,100
  • 1
  • 9
  • 24
  • I don't use the Request of scrapy, in my code response is parsel.Selector(driver.page_source). I ran your code and i had Number of items = 42 but on the site there is more than 300 items – parik Apr 27 '21 at 17:32
  • 1
    @parik as I've said above I get **300** items running the code (every time I've ran it). Do you see Firefox scroll to the bottom of the page? – tomjn Apr 27 '21 at 17:53
  • yes your are right, it works, thank you very much – parik Apr 27 '21 at 22:16
0

I solved my problem but not with Selenium, We can have all the products of search by another request: https://www.compraonline.bonpreuesclat.cat/api/v4/products/search?limit=1000&offset=0&sort=favorite&term=pasta

parik
  • 2,313
  • 12
  • 39
  • 67