Find href on web page

Question

I don't understand why the following will not work - I am looking for and trying to click this specific link:

<a href="#/documents/2077">

From the URL: https://species-registry.canada.ca/index-en.html#/documents?documentTypeId=18&sortBy=documentTypeSort&sortDirection=asc&pageSize=10&keywords=Victoria%27s%20Owl-clover

From a starting point of that URL I have tried a few things including the following:

Attempt #1

WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.PARTIAL_LINK_TEXT,"COSEWIC-Assessment-and-status-report")))

and

appraisal_html = driver.find_element_by_partial_link_text("COSEWIC-Assessment-and-status-report")

Attempt #2

soup = bs(req.text,'html.parser')
for link in soup.find_all('a'):`
print(link.get('href'))`

Among other things. Keeping in mind that this is a generalized search in the sense that the species name will change every time I make this search, everything else should remain similar.

The second attempt is straight from the beautiful soup documentation and finds a whole bunch of links like the ones under the menu tab etc but not the href I am looking for.

The first attempt for some reason just times out without finding the partial text I input. Maybe this is because that is the text on the page and not the href itself?

One solution I am not thinking of is to look for the bounding box within which the link is found first and then look for the link within the new smaller search area but I still don't know why I am unable to find the right link from the entire page.

undetected Selenium · Accepted Answer · 2021-12-10T06:48:08.550

A couple of things here:

COSEWIC-Assessment-and-status-report isn't the exact text, but it is COSEWIC Assessment and Status Report on the Victoria’s Owl-clover

The text is not within the A tag but within a SPAN:

<span data-v-7ee3c58f="" class="name-primary">COSEWIC Assessment and Status Report on the Victoria’s Owl-clover <em>Castilleja victoriae</em> in Canada</span>

So to identify the clickable element you need to induce WebDriverWait for the element_to_be_clickable() and you can use either of the following Locator Strategies:

Using XPATH:

driver.get("https://species-registry.canada.ca/index-en.html#/documents?documentTypeId=18&sortBy=documentTypeSort&sortDirection=asc&pageSize=10&keywords=Victoria%27s%20Owl-clover")
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH,"//span[contains(., 'COSEWIC Assessment and Status Report on the Victoria’s Owl-clover')]"))).click()

Note: You have to add the following imports :

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

score 1 · Answer 2 · answered Dec 10 '21 at 06:42

1

Try this:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time


chrome_options = Options()
#chrome_options.add_argument("--headless")
#chrome_options.add_argument("user-agent=Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36")


driver = webdriver.Chrome(executable_path="./chromedriver", options=chrome_options)

driver.get("https://species-registry.canada.ca/index-en.html#/documents?documentTypeId=18&sortBy=documentTypeSort&sortDirection=asc&pageSize=10&keywords=Victoria%27s%20Owl-clover")
time.sleep(2)

driver.find_element_by_xpath("//a[@class='card-header']").click()

answered Dec 10 '21 at 06:42

fam

583
3
14

ah, the `find_element_by_xpath` line worked! Is that xpath (stuff in the brackets) what you get from doing a "copy XPath" from the source HTML. Thanks for your help – Max Duso Dec 10 '21 at 17:48
You are most welcome! Yes, that is one way of finding xpaths (not a recommended way though). I manually design the xpaths by analyzing the html tags. – fam Dec 13 '21 at 04:57
would you be able to explain though why my methods won't work? Mainly because I am not on the next page trying to find a similar element but don't seem to be as skilled as you at manually designing an XPath. – Max Duso Dec 15 '21 at 21:52
I am not saying your methods won't work. I am just saying that it is not the best way as sometimes they won't. Please have a look at this answer. It will give you an idea of what I am saying. https://stackoverflow.com/questions/43090530/why-xpath-derived-from-chrome-does-not-work – fam Dec 16 '21 at 05:24

score 1 · Answer 3 · answered Dec 10 '21 at 10:12

import requests
from pprint import pp
headers = {
    "api-key": "3A1E8E87503C069448999238ABD05EE9"
}

params = {
    'api-version': '2017-11-11'
}


def main(url):
    with requests.Session() as req:
        req.headers.update(headers)
        req.params = params
        data = {
            "count": 'true',
            "filter": "((documentTypeId eq 18))",
            "orderby": "documentTypeSort asc,sortDate desc,documentCreateDate asc,documentTitleSort asc",
            "queryType": "full",
            "search": "/.*Victoria's.*/ /.*Owl-clover.*/",
            "searchMode": "all",
            "select": "id,consultationEndDate,consultationStartDate,consultationActivationStatusId,documentCreateDate,documentDescription,documentTitle,documentTypeId,species,attachments,contacts,links,finalOrDelayed",
            "skip": 0,
            "top": 10
        }
        r = req.post(url, json=data)
        ndata = {
            'filter': f"id eq '{r.json()['value'][0]['id']}'"
        }
        r = req.post(url, json=ndata)
        pp(r.json())


main('https://ecprccsarsrch.search.windows.net/indexes/docblobidxen/docs/search')

Output:

{'@odata.context': "https://ecprccsarsrch.search.windows.net/indexes('docblobidxen')/$metadata#docs(*)",
 'value': [{'@search.score': 1.0,
            'id': '2077',
            'documentTitle': 'COSEWIC Assessment and Status Report on the '
                             'Victoria’s Owl-clover <em>Castilleja '
                             'victoriae</em> in Canada',
            'documentCreateDate': '2010-09-01T13:54:36.8Z',
            'documentDescription': 'Victoria’s Owl-clover (<em>Castilleja '
                                   'victoriae</em>) is a newly described '
                                   'species, previously misidentified as  '
                                   '(<em>C. ambigua</em> ssp. '
                                   '<em>ambigua</em>). It is a small herb of '
                                   'the broomrape family with alternate,  '
                                   'hairy, lobed stem leaves and no basal '
                                   'rosette. The wider and more deeply lobed  '
                                   'upper leaves grade into the floral bracts. '
                                   'The sepals are fused into a  five-lobed '
                                   'calyx, and the petals are fused into a '
                                   '2-lipped flower 10-18 mm  long. The lower '
                                   'lip is lemon-yellow with minute white tips '
                                   'on each of the three  lobes. The upper lip '
                                   'is slightly longer than the lower lip and '
                                   'creamy white. The  fruits are brown, '
                                   '2-celled capsules that split at the tip '
                                   'when the seeds are  ripe. Each capsule '
                                   'bears 30-70 brown seeds with a sculptured '
                                   'seed coat.',
            'documentTypeId': 18,
            'consultationStartDate': None,
            'consultationEndDate': None,
            'consultationActivationStatusId': 0,
            'finalOrDelayed': 6,
            'attachments': ['{"attachmentId":"8142","attachmentTitle":"COSEWIC '
                            'Assessment and Status Report on the Victoria’s '
                            'Owl-clover <em>Castilleja victoriae</em> in '
                            'Canada","attachmentPublicationDate":"2010-09-03T00:00:00","file":"/cosewic/sr_Victoria\'s '
                            'Owl-clover_0810_e.pdf","html":"https://www.canada.ca/en/environment-climate-change/services/species-risk-public-registry/cosewic-assessments-status-reports/victoria-owl-clover-2010.html"}'],
            'contacts': ['{"salutation":"None","title":"","id":33,"firstName":"","lastName":"","organization":"COSEWIC '
                         'Secretariat","address":"c/o Canadian Wildlife '
                         'Service\\r\\n Environment '
                         'Canada","postalCode":"K1A0H3","city":"Ottawa","province":"ON","phone":"8199384125","email":"cosewic-cosepac@ec.gc.ca","fax":"8199383984"}'],
            'links': [],
            'species': ['1084-749']}]}

Md. Fazlul Hoque · Answer 4 · 2021-12-10T17:44:11.513

I use selenium with bs4. The urls that you want to grab are relatives and I've also converted them into absolute urls.You can get absolute urls from uncomment portion.

PS: You need just to install manager: pip install webdriver-manager and run the script.

Script:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import time


url = 'https://species-registry.canada.ca/index-en.html#/documents?sortBy=documentTypeSort&sortDirection=asc&currentPage=1&pageSize=10'

cm = ChromeDriverManager().install()
driver = webdriver.Chrome(cm)

driver.maximize_window()
time.sleep(8)
driver.get(url)
time.sleep(5)

base_url = 'https://species-registry.canada.ca/index-en.html'
soup = BeautifulSoup(driver.page_source, 'html.parser')
hrefs=soup.find_all('a',class_='card-header')

for href in hrefs:
    relative_url= href['href']
    print(relative_url)
    #abs_url= base_url + href['href']
    #print(abs_url)

Output as relatives:

#/documents/2968
#/documents/3002
#/documents/1590
#/documents/3332
#/documents/3354
#/documents/3357
#/documents/1451
#/documents/3325
#/documents/3333
#/documents/205

Output as absolute urls:

https://species-registry.canada.ca/index-en.html#/documents/2968
https://species-registry.canada.ca/index-en.html#/documents/3002
https://species-registry.canada.ca/index-en.html#/documents/1590
https://species-registry.canada.ca/index-en.html#/documents/3332
https://species-registry.canada.ca/index-en.html#/documents/3354
https://species-registry.canada.ca/index-en.html#/documents/3357
https://species-registry.canada.ca/index-en.html#/documents/1451
https://species-registry.canada.ca/index-en.html#/documents/3325
https://species-registry.canada.ca/index-en.html#/documents/3333
https://species-registry.canada.ca/index-en.html#/documents/205

Find href on web page

4 Answers4

Script:

Output as relatives:

Output as absolute urls: