0

I am new to the selenium framework and I must say it is an awesome library. I am basically trying to get all links from a webpage that has a particular id "pagination", and isolate them from links that don't have such id, reasons because I want to go through all the pages in this link.

for j in browser.find_elements(By.CSS_SELECTOR, "div#col-content > div.main-menu2.main-menu-gray strong a[href]"): 
     print(j.get_property('href')))

The code above gets all the links with and without pagination.

example links with pagination.

https://www.oddsportal.com/soccer/africa/africa-cup-of-nations-2015/results/
https://www.oddsportal.com/soccer/england/premier-league-2020-2021/results/
https://www.oddsportal.com/soccer/africa/africa-cup-of-nations-2021/results/
https://www.oddsportal.com/soccer/africa/africa-cup-of-nations-2019/results/

example links without pagination.

https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/

In my code, I try to find if the given ID exists on the page, pagination = browser.find_element(By.ID, "pagination") but I stumble on an error, I understand the reason for the error, and it is coming from the fact that the ID "pagination" does not exist on some of the links.

no such element: Unable to locate element: {"method":"css selector","selector":"[id="pagination"]"}

I changed the above code to pagination = browser.find_elements(By.ID, "pagination"), which returns links with and without pagination. so my question is how can I get links that has a particular id from list of links.

from selenium.webdriver import Chrome, ChromeOptions
from selenium.webdriver.common.by import By
import time
import tqdm
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC



#define our URL
url = 'https://oddsportal.com/results/'
path = r'C:\Users\Glodaris\OneDrive\Desktop\Repo\Scraper\chromedriver.exe'
options = ChromeOptions()
options.headless = True

# options=options
browser = Chrome(executable_path=path, options=options)
browser.get(url)

title = browser.title
print('Title', title)


links = []
for i in browser.find_elements(By.CSS_SELECTOR, "div#archive-tables tbody tr[xsid='1'] td a[href]"):
    links.append(i.get_property('href'))

arr = []
condition = True
while condition:
    for link in (links):
        second_link = browser.get(link)
        for j in browser.find_elements(By.CSS_SELECTOR, "div#col-content > div.main-menu2.main-menu-gray strong a[href]"):
            browser.implicitly_wait(2)
            pagination = browser.find_element(By.ID, "pagination")
            if pagination:
                print(pagination.get_property('href')))
            else:
                print(j.get_property('href')))
    try:
        browser.find_elements("xpath", "//*[@id='pagination']/a[6]")
    except:
        condition = False

2 Answers2

1

As you are using Selenium, you are able to actually click on the pagination's forward button to navigate through pages. The following example will test for cookie button, will scrape the data from the main table as a dataframe, will check if there is pagination, and if not, will stop there. If there is pagination, will navigate to next page, get the data from the table, navigate to the next page and so on, until the table data from the page is identical with table data from previous page, and then will stop. It is able to handle an n number of pages. The setup in the code below is for linux, what you need to pay attention to is the imports part, as well as the part after you define the browser/driver.

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time as t
import pandas as pd

chrome_options = Options()
chrome_options.add_argument("--no-sandbox")


webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)

# url='https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/'
url = 'https://www.oddsportal.com/soccer/africa/africa-cup-of-nations-2021/results/'
browser.get(url)
try:
    WebDriverWait(browser, 10).until(EC.element_to_be_clickable((By.ID, "onetrust-reject-all-handler"))).click()
except Exception as e:
    print('no cookie button!')
games_table = WebDriverWait(browser, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "table[id='tournamentTable']")))
try:
    initial_games_table_data = games_table.get_attribute('outerHTML')
    dfs = pd.read_html(initial_games_table_data)
    print(dfs[0])
except Exception as e:
    print(e, 'Unfortunately, no matches can be displayed because there are no odds available from your selected bookmakers.')
while True:
    browser.execute_script("window.scrollTo(0,document.body.scrollHeight);")
    t.sleep(1)
    try:
        forward_button = WebDriverWait(browser, 20).until(EC.element_to_be_clickable((By.XPATH, "//div[@id='pagination']//span[text()='»']")))
        forward_button.click()  
    except Exception as e:
        print(e, 'no pagination, stopping here')
        break
    games_table = WebDriverWait(browser, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "table[id='tournamentTable']")))
    dfs = pd.read_html(games_table.get_attribute('outerHTML'))
    games_table_data = games_table.get_attribute('outerHTML')
    if games_table_data == initial_games_table_data:
        print('this is the last page')
        break
    print(dfs[0])
    initial_games_table_data = games_table_data
    print('went to next page')
    t.sleep(3)
Barry the Platipus
  • 9,594
  • 2
  • 6
  • 30
0

You are seeing the error message...

no such element: Unable to locate element: {"method":"css selector","selector":"[id="pagination"]"}

...as all the pages doesn't contain the element:

<div id="pagination">
    <a ...>
    <a ...>
    <a ...>
</div>

Solution

In these cases your best approach would be to wrapup the code block with in a try-except{} block as follows:

for j in browser.find_elements(By.CSS_SELECTOR, "div#col-content > div.main-menu2.main-menu-gray strong a[href]"):
    try:
        WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.ID, "pagination")))
        print([my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div#pagination a[href*='page']")))])
    except:
        print("Pagination not available")
        continue

Note : You have to add the following imports :

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

Update

A couple of things to note.

  • The (By.ID, "pagination") element doesn't have a href attribute but the several decendants have. So you may find conflicting results.

href

  • As you are using WebDriverWait remember to remove all the instances of implicitly_wait() as mixing implicit and explicit waits can cause unpredictable wait times. For example setting an implicit wait of 10 seconds and an explicit wait of 15 seconds, could cause a timeout to occur after 20 seconds.
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
  • I am getting `None` values on my terminal and it takes a lot of time to run because of the 20 sec wait time. I just updated the above script with the sample of the links I want to isolate from the rest because of the pagination on it, and the line on the script that generates all the links. – benjamin olise Jul 29 '22 at 23:07
  • `20` sec wait time is a hardcoded value for testing purpose. You can reduce it to 5,4,3,2 and even 1 sec. Ideally the wait timer for different activity should be part of the **Test Specification**. Checkout the answer update. I tried to explain about the `None` which is again about your _Test Steps_ – undetected Selenium Jul 29 '22 at 23:18
  • I selected `(By.ID, "pagination")` because the rest of the pages with pagination does not have such attribute. I also tried using the next button link on `(By.CSS_SELECTOR, "#pagination > a:nth-last-child(2)"` to get links for only the pagination page and it is taking a long time to display output on the terminal. in fact it doesn't display any output. – benjamin olise Jul 29 '22 at 23:25
  • _because the rest of the pages with pagination does not have such attribute_: My answer takes care of this requirement only :) I didn't followup on the locator strategy part. – undetected Selenium Jul 29 '22 at 23:30
  • Thanks for the prompt response, what I simply what to achieve is that given the above links, I want to check if the pagination `(By.ID, "pagination")` attribute exists. if so, I want to navigate the pagination links. but the above solution does not solve that. I have tried different wait times but I am not getting a response. – benjamin olise Jul 29 '22 at 23:41
  • wait time isn't the issue here, you can keep a standard wait of 5 sec, that's enough across any website. Here your requirement and logic is the main question/concern. I have solved the first part. – undetected Selenium Jul 29 '22 at 23:47
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/246900/discussion-between-benjamin-olise-and-undetected-selenium). – benjamin olise Jul 29 '22 at 23:52
  • Let's discuss the issue in [Selenium](https://chat.stackoverflow.com/rooms/223360/selenium) room. – undetected Selenium Jul 29 '22 at 23:56
  • Checkout the update answer and let me know the status. – undetected Selenium Jul 30 '22 at 20:12