Extracting URL from href attribute of specific tag using Python and Selenium

Question

I'm trying to extract a URL from the href attribute of a specific tag on a webpage using Python and Selenium. The tag has a href attribute that starts with "javascript:SetAzurePlayerFileName", and I want to extract the URL that follows this string.

Here's an example of the tag (from this page):

<a href="javascript:SetAzurePlayerFileName('https://video.knesset.gov.il/KnsVod/_definst_/mp4:CMT/CmtSession_2081117.mp4/manifest.mpd',...">Link text</a>

I want to extract the URL "https://video.knesset.gov.il/KnsVod/_definst_/mp4:CMT/CmtSession_2081117.mp4/manifest.mpd".

I've tried using BeautifulSoup with the requests library, but it seems the content of the webpage is loaded dynamically with JavaScript, so requests can't fetch the tag.

I then tried using Selenium with ChromeDriver, but I encountered issues with the WebDriver and the browser driver. I also tried setting the User-Agent header and waiting for the tag to be present with WebDriverWait, but I still couldn't find the element.

I switched to Firefox and GeckoDriver, but I'm still unable to find the element, and I'm getting a "NoSuchElementError".

Here's the latest version of my script:

from selenium import webdriver
from selenium.webdriver.firefox.service import Service
from webdriver_manager.firefox import GeckoDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import re

def get_url_from_webpage(url):
    options = webdriver.FirefoxOptions()
    options.add_argument('user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0')
    driver = webdriver.Firefox(service=Service(GeckoDriverManager().install()), options=options)
    driver.get(url)

    try:
        a_tag = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '//a[starts-with(@href, "javascript:SetAzurePlayerFileName")]')))
        href = a_tag.get_attribute('href')
        match = re.search(r"javascript:SetAzurePlayerFileName\('(.*)',", href)
        if match:
            return match.group(1)
    except Exception as e:
        print(e)
    finally:
        driver.quit()

    return None

Does anyone have any ideas on how I can successfully extract the URL from the href attribute of this tag?

Note: I believe that the main issue is that the content doesn't seem to be loaded in this mode of Python, rather than a problem with the regex or the XPath expression.

undetected Selenium · Answer 1 · 2023-06-29T15:35:59.400

Given the HTML:

<a href="javascript:SetAzurePlayerFileName('https://video.knesset.gov.il/KnsVod/_definst_/mp4:CMT/CmtSession_2081117.mp4/manifest.mpd',...">Link text</a>

To print the value of the href attribute you need to induce WebDriverWait for the visibility_of_element_located() and get_attribute() method and then using split() with respect to the . character and you can use either of the following locator strategies:

Using LINK_TEXT and split('\''):

print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.LINK_TEXT, "Link text"))).get_attribute("href").split('\'')[1])

Using XPATH and split("'"):

print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//a[text()='Link text']"))).get_attribute("href").split("'")[1])

Note : You have to add the following imports :

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

You can find a relevant discussion in Python Selenium - get href value

Proof of concept

html = '''
    <a href="javascript:SetAzurePlayerFileName('https://video.knesset.gov.il/KnsVod/_definst_/mp4:CMT/CmtSession_2081117.mp4/manifest.mpd',...">Link text</a>
'''

print(html.split('\'')[1]) # prints -> https://video.knesset.gov.il/KnsVod/_definst_/mp4:CMT/CmtSession_2081117.mp4/manifest.mpd
print(html.split("'")[1]) # prints -> https://video.knesset.gov.il/KnsVod/_definst_/mp4:CMT/CmtSession_2081117.mp4/manifest.mpd

I received `TimeoutException: Message: Stacktrace: RemoteError@chrome://remote/content/shared/RemoteError.sys.mjs:8:8 WebDriverError@chrome://remote/content/shared/webdriver/Errors.sys.mjs:183:5 NoSuchElementError@chrome://remote/content/shared/webdriver/Errors.sys.mjs:395:5 element.find/<@chrome://remote/content/marionette/element.sys.mjs:134:16` — Yanirmr, Jun 29 '23 at 15:26
@Yanirmr I can't access the url, I simply used the HTML you provided. Checkout the answer update. — undetected Selenium, Jun 29 '23 at 15:36
Do you have any idea why the page doesn't load while using selenium? — Yanirmr, Jun 29 '23 at 15:47
@Yanirmr Check if the desired element is within an [**iframe**](https://stackoverflow.com/a/53276478/7429447) or within a [**shadowRoot**](https://stackoverflow.com/a/73242476/7429447) — undetected Selenium, Jun 29 '23 at 19:11

Extracting URL from href attribute of specific tag using Python and Selenium

1 Answers1

Proof of concept