0

I'm writing my first real scraper and although in general it's been going well, I've hit a wall using Selenium. I can't get it to go to the next page.

Below is the head of my code. The output below this is just printing out data in terminal for now and that's all working fine. It just stops scraping at the end of page 1 and shows me my terminal prompt. It never starts on page 2. I would be so grateful if anyone could make a suggestion. I've tried selecting the button at the bottom of the page I'm trying to scrape using both the relative and full Xpath (you're seeing the full one here) but neither work. I'm trying to click the right-arrow button.

I built in my own error message to indicate whether the driver successfully found the element by Xpath or not. The error message fires when I execute my code, so I guess it's not finding the element. I just can't understand why not.

# Importing libraries
import requests
import csv
import re
from urllib.request import urlopen
from bs4 import BeautifulSoup

# Import selenium 
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, WebDriverException
import time

options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
options.add_argument('--headless')
driver = webdriver.Chrome("/path/to/driver", options=options)
# Yes, I do have the actual path to my driver in the original code

driver.get("https://uk.eu-supply.com/ctm/supplier/publictenders?B=UK")
time.sleep(5)
while True:
    try:
        driver.find_element_by_xpath('/html/body/div[1]/div[3]/div/div/form/div[3]/div/div/ul[1]/li[4]/a').click()
    except (TimeoutException, WebDriverException) as e:
        print("A timeout or webdriver exception occurred.")
        break
driver.quit()
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
MBWD
  • 122
  • 12

2 Answers2

0

What you can do is to set up Selenium expected conditions (visibility_of_element_located, element_to_be_clickable) and use a relative XPath to select the next page element. All of this in a loop (its range is the number of pages you have to deal with).

XPath for the next page link :

//div[@class='pagination ctm-pagination']/ul[1]/li[last()-1]/a

Code could look like :

## imports

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver.get("https://uk.eu-supply.com/ctm/supplier/publictenders?B=UK")

## count the number of pages you have

els = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[@class='pagination ctm-pagination']/ul[1]/li[last()]/a"))).get_attribute("data-current-page")

## loop. at the end of the loop, click on the following page

for i in range(int(els)):
    ***scrape what you want***
    WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//div[@class='pagination ctm-pagination']/ul[1]/li[last()-1]/a"))).click()
E.Wiest
  • 5,425
  • 2
  • 7
  • 12
  • Got this error when trying the above: `Traceback (most recent call last): ` ` File "eu-supply-scraper.py", line 51, in ` ` for i in range(int(els)):` `ValueError: invalid literal for int() with base 10: 'https://uk.eu-supply.com/ctm/supplier/2' ` – MBWD Jul 30 '20 at 17:15
  • Also, can I ask where you got that Xpath? It doesn't match the one I get when copying the next page arrow Xpath from Inspector. – MBWD Jul 30 '20 at 17:32
  • Code has been fixed. In your case, `els` should contain `2`. I've replaced `href` with `data-current-page` to get the number of pages. Regarding the XPath : I've written them myself. XPath is a very useful language to learn.:) – E.Wiest Aug 01 '20 at 02:40
  • Thanks very much. This works to click on to the second page. However, for some reason it scrapes the first page twice and doesn't scrape the second page. (there are only two pages on this site at the moment) I'm printing to terminal right now rather than saving to a file and the output for page 1 just gets repeated, then it quits. I've been trying different things but can't get it to scrape page 2. Here's what my code looks like currently: https://pastebin.com/NvFrrrp5 – MBWD Aug 01 '20 at 11:35
  • Looks like I've got it sorted mate... I needed to wait longer after loading the second page. A thousand thanks! – MBWD Aug 01 '20 at 13:15
  • Great ! Good catch on the waiting time. The code you posted on pastebin was private so I couldn't investigate on the issue. Keep up the good work.;) – E.Wiest Aug 01 '20 at 16:00
0

You were pretty close with while True and try-catch{} logic. To go to the next page using Selenium and you have to induce WebDriverWait for element_to_be_clickable() and you can use either of the following Locator Strategies:

  • Code Block:

    driver.get("https://uk.eu-supply.com/ctm/supplier/publictenders?B=UK")
    while True:
        try:
            WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//a[contains(@class, 'state-active')]//following::li[1]/a[@href]"))).click()
            print("Clicked for next page")
            WebDriverWait(driver, 10).until(EC.staleness_of(driver.find_element_by_xpath("//a[contains(@class, 'state-active')]//following::li[1]/a[@href]")))
        except (TimeoutException):
            print("No more pages")
            break
    driver.quit()
    
  • Console Output:

    Clicked for next page
    No more pages
    
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
  • This worked insofar as it printed out "Clicked for next page", but then it prints "No more pages" which is your message for the timeout exception. Terminal still only shows just the first page of data. – MBWD Jul 31 '20 at 08:45
  • @MBWD If I remember it, clicking on next page, doesn't changes the url, only the page number becomes `disabled` with the page number being faded out which worked perfect at my end. – undetected Selenium Jul 31 '20 at 08:48
  • Am I correctly passing the page data to Beautiful Soup after selenium clicks the button? https://pastebin.com/ikNxj80G – MBWD Jul 31 '20 at 09:21
  • ATM, I'd avoid any comment with BS, as it needs further research :) – undetected Selenium Jul 31 '20 at 09:31
  • one more question for you... you gave the Xpath (correctly) as: //a[contains(@class, 'state-active')]//following::li[1]/a[@href] but that is not what I get when I get the Xpath from the a link of the next button in Inspector. Can I ask how you found that? – MBWD Jul 31 '20 at 20:50