4

I'm trying to create a basic web scraper for Amazon results. As I'm iterating through results, I sometimes get to page 5 (sometimes only page 2) of the results and then a StaleElementException is thrown. When I look at the browser after the exception is thrown, I can see that the driver/page did not scroll down to where the page numbers are (bottom bar).

My code:

driver.get('https://www.amazon.com/s/ref=nb_sb_noss_1?url=search-alias%3Daps&field-keywords=sonicare+toothbrush')

for page in range(1,last_page_number +1):

    driver.implicitly_wait(10)

    bottom_bar = driver.find_element_by_class_name('pagnCur')
    driver.execute_script("arguments[0].scrollIntoView(true);", bottom_bar)

    current_page_number = int(driver.find_element_by_class_name('pagnCur').text)

    if page == current_page_number:
        next_page = driver.find_element_by_xpath('//div[@id="pagn"]/span[@class="pagnLink"]/a[text()="{0}"]'.format(current_page_number+1))
        next_page.click()
        print('page #',page,': going to next page')
    else:
        print('page #: ', page,'error')

I've looked at this question, and I'm guessing that a similar fix can be applied, but I'm not sure how to find something on the page that disappears. Also, based on how quickly the print statements are occurring, I can see that the implicitly_wait(10) isn't actually waiting a full 10 seconds.

The exception is pointing to the line that starts with "driver.execute_script". This is the exception:

StaleElementReferenceException: Message: The element reference of <span class="pagnCur"> is stale; either the element is no longer attached to the DOM, it is not in the current frame context, or the document has been refreshed

Sometimes I'll get a ValueError:

ValueError: invalid literal for int() with base 10: ''

So these errors/exceptions lead me to believe that there is something going on with waiting for the page to refresh completely.

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
Mariah Akinbi
  • 386
  • 1
  • 5
  • 19

2 Answers2

3

If you just want your script to iterate over all the result pages, you don't need any complicated logic - just make a click on Next button while it's possible:

from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait as wait
from selenium.common.exceptions import TimeoutException

driver = webdriver.Chrome()

driver.get('https://www.amazon.com/s/ref=nb_sb_noss_1?url=search-alias%3Daps&field-keywords=sonicare+toothbrush')

while True:
    try:
        wait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, 'a > span#pagnNextString'))).click()
    except TimeoutException:
        break

P.S. Also note that implicitly_wait(10) should not wait full 10 seconds, but wait up to 10 seconds for element to appear in HTML DOM. So if element is found within 1 or 2 seconds then wait is done and you will not wait rest 8-9 seconds...

Andersson
  • 51,635
  • 17
  • 77
  • 129
  • 1
    Cleanest approach as usual. – SIM Dec 06 '18 at 07:37
  • @andersson this worked beautifully! Thank you! how did you know that 'a > span#pagnNextString' is the appropriate css selector? When I inspect the next button and copy the css selector it shows up as '#pagnNextString'. Also, thank you for explaining implicitly_wait()! – Mariah Akinbi Dec 06 '18 at 08:38
  • 1
    @MariahAkinbi , Note that on last page Next button (span with `id="pagnNextString"`) is not a child of anchor (`a`), but Selenium (for some reason) still "think" that it is clickable. So to break the loop on the last iteration we should explicitly specify that we need *a link with `"pagnNextString"` child, but not just element `"pagnNextString"`* – Andersson Dec 06 '18 at 08:43
3

This error message...

StaleElementReferenceException: Message: The element reference of <span class="pagnCur"> is stale; either the element is no longer attached to the DOM, it is not in the current frame context, or the document has been refreshed

...implies that the previous reference of the element is now stale and the element reference is no longer present on the DOM of the page.

The common reasons behind this this issue are:

  • The element have changed position within the HTML.
  • The element is no longer attached to the DOM TREE.
  • The webpage on which the element was part of has been refreshed.
  • The previous instance of element has been refreshed by a JavaScript or an AjaxCall.

This usecase

Preserving your concept of scrolling through scrollIntoView() and printing a couple of helpful debug messages, I have made some minor adjustments inducing WebDriverWait and you can use the following solution:

  • Code Block:

    from selenium import webdriver
    from selenium.webdriver.chrome.options import Options
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    
    options = Options()
    options.add_argument("start-maximized")
    options.add_argument('disable-infobars')
    options.add_argument("--disable-extensions")
    driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
    driver.get("https://www.amazon.com/s/ref=nb_sb_noss_1?url=search-alias%3Daps&field-keywords=sonicare+toothbrush")
    while True:
        try:
            current_page_number_element = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "span.pagnCur")))
            driver.execute_script("arguments[0].scrollIntoView(true);", current_page_number_element)
            current_page_number = current_page_number_element.get_attribute("innerHTML")
            WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "span.pagnNextArrow"))).click()
            print("page # {} : going to next page".format(current_page_number))
        except:
            print("page # {} : error, no more pages".format(current_page_number))
            break
    driver.quit()
    
  • Console Output:

    page # 1 : going to next page
    page # 2 : going to next page
    page # 3 : going to next page
    page # 4 : going to next page
    page # 5 : going to next page
    page # 6 : going to next page
    page # 7 : going to next page
    page # 8 : going to next page
    page # 9 : going to next page
    page # 10 : going to next page
    page # 11 : going to next page
    page # 12 : going to next page
    page # 13 : going to next page
    page # 14 : going to next page
    page # 15 : going to next page
    page # 16 : going to next page
    page # 17 : going to next page
    page # 18 : going to next page
    page # 19 : going to next page
    page # 20 : error, no more pages
    
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
  • 1
    this works great!!! Thank you! What is the purpose of the second WebDriverWait line? – Mariah Akinbi Dec 06 '18 at 08:40
  • 1
    @MariahAkinbi First `WebDriverWait` for the _current_page_number_element_ to be **visible** before we attempt to scroll. Once we have already scrolled second `WebDriverWait` for the `element_to_be_clickable` so that our solution works flawless cross platform. – undetected Selenium Dec 06 '18 at 08:44
  • 1
    okay, makes sense! If the element is visible, doesn't that mean it's clickable? Or I could skip the visible wait and only use the clickable wait - because all that matters is if it's clickable? – Mariah Akinbi Dec 06 '18 at 08:49
  • 1
    No, if the element is **visible** doesn't guarantees it's **clickable**. Ideally, if you are not clicking _visible wait_ is sufficient but before you attempt to click, _click wait_ is needed to make your program flawless cross platform. – undetected Selenium Dec 06 '18 at 08:58