0

I'm trying to extract a URL from the href attribute of a specific tag on a webpage using Python and Selenium. The tag has a href attribute that starts with "javascript:SetAzurePlayerFileName", and I want to extract the URL that follows this string.

Here's an example of the tag (from this page):

<a href="javascript:SetAzurePlayerFileName('https://video.knesset.gov.il/KnsVod/_definst_/mp4:CMT/CmtSession_2081117.mp4/manifest.mpd',...">Link text</a>

I want to extract the URL "https://video.knesset.gov.il/KnsVod/_definst_/mp4:CMT/CmtSession_2081117.mp4/manifest.mpd".

I've tried using BeautifulSoup with the requests library, but it seems the content of the webpage is loaded dynamically with JavaScript, so requests can't fetch the tag.

I then tried using Selenium with ChromeDriver, but I encountered issues with the WebDriver and the browser driver. I also tried setting the User-Agent header and waiting for the tag to be present with WebDriverWait, but I still couldn't find the element.

I switched to Firefox and GeckoDriver, but I'm still unable to find the element, and I'm getting a "NoSuchElementError".

Here's the latest version of my script:

from selenium import webdriver
from selenium.webdriver.firefox.service import Service
from webdriver_manager.firefox import GeckoDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import re

def get_url_from_webpage(url):
    options = webdriver.FirefoxOptions()
    options.add_argument('user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0')
    driver = webdriver.Firefox(service=Service(GeckoDriverManager().install()), options=options)
    driver.get(url)

    try:
        a_tag = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '//a[starts-with(@href, "javascript:SetAzurePlayerFileName")]')))
        href = a_tag.get_attribute('href')
        match = re.search(r"javascript:SetAzurePlayerFileName\('(.*)',", href)
        if match:
            return match.group(1)
    except Exception as e:
        print(e)
    finally:
        driver.quit()

    return None

Does anyone have any ideas on how I can successfully extract the URL from the href attribute of this tag?

Note: I believe that the main issue is that the content doesn't seem to be loaded in this mode of Python, rather than a problem with the regex or the XPath expression.

baduker
  • 19,152
  • 9
  • 33
  • 56
Yanirmr
  • 923
  • 8
  • 25

1 Answers1

0

Given the HTML:

<a href="javascript:SetAzurePlayerFileName('https://video.knesset.gov.il/KnsVod/_definst_/mp4:CMT/CmtSession_2081117.mp4/manifest.mpd',...">Link text</a>

To print the value of the href attribute you need to induce WebDriverWait for the visibility_of_element_located() and get_attribute() method and then using split() with respect to the . character and you can use either of the following locator strategies:

  • Using LINK_TEXT and split('\''):

    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.LINK_TEXT, "Link text"))).get_attribute("href").split('\'')[1])
    
  • Using XPATH and split("'"):

    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//a[text()='Link text']"))).get_attribute("href").split("'")[1])
    
  • Note : You have to add the following imports :

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    

You can find a relevant discussion in Python Selenium - get href value


Proof of concept

html = '''
    <a href="javascript:SetAzurePlayerFileName('https://video.knesset.gov.il/KnsVod/_definst_/mp4:CMT/CmtSession_2081117.mp4/manifest.mpd',...">Link text</a>
'''

print(html.split('\'')[1]) # prints -> https://video.knesset.gov.il/KnsVod/_definst_/mp4:CMT/CmtSession_2081117.mp4/manifest.mpd
print(html.split("'")[1]) # prints -> https://video.knesset.gov.il/KnsVod/_definst_/mp4:CMT/CmtSession_2081117.mp4/manifest.mpd
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
  • I received `TimeoutException: Message: Stacktrace: RemoteError@chrome://remote/content/shared/RemoteError.sys.mjs:8:8 WebDriverError@chrome://remote/content/shared/webdriver/Errors.sys.mjs:183:5 NoSuchElementError@chrome://remote/content/shared/webdriver/Errors.sys.mjs:395:5 element.find/<@chrome://remote/content/marionette/element.sys.mjs:134:16` – Yanirmr Jun 29 '23 at 15:26
  • 1
    @Yanirmr I can't access the url, I simply used the HTML you provided. Checkout the answer update. – undetected Selenium Jun 29 '23 at 15:36
  • 1
    Do you have any idea why the page doesn't load while using selenium? – Yanirmr Jun 29 '23 at 15:47
  • @Yanirmr Check if the desired element is within an [**iframe**](https://stackoverflow.com/a/53276478/7429447) or within a [**shadowRoot**](https://stackoverflow.com/a/73242476/7429447) – undetected Selenium Jun 29 '23 at 19:11