I'm trying to extract a URL from the href attribute of a specific tag on a webpage using Python and Selenium. The tag has a href attribute that starts with "javascript:SetAzurePlayerFileName", and I want to extract the URL that follows this string.
Here's an example of the tag (from this page):
<a href="javascript:SetAzurePlayerFileName('https://video.knesset.gov.il/KnsVod/_definst_/mp4:CMT/CmtSession_2081117.mp4/manifest.mpd',...">Link text</a>
I want to extract the URL "https://video.knesset.gov.il/KnsVod/_definst_/mp4:CMT/CmtSession_2081117.mp4/manifest.mpd"
.
I've tried using BeautifulSoup with the requests library, but it seems the content of the webpage is loaded dynamically with JavaScript, so requests can't fetch the tag.
I then tried using Selenium with ChromeDriver, but I encountered issues with the WebDriver and the browser driver. I also tried setting the User-Agent header and waiting for the tag to be present with WebDriverWait, but I still couldn't find the element.
I switched to Firefox and GeckoDriver, but I'm still unable to find the element, and I'm getting a "NoSuchElementError".
Here's the latest version of my script:
from selenium import webdriver
from selenium.webdriver.firefox.service import Service
from webdriver_manager.firefox import GeckoDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import re
def get_url_from_webpage(url):
options = webdriver.FirefoxOptions()
options.add_argument('user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0')
driver = webdriver.Firefox(service=Service(GeckoDriverManager().install()), options=options)
driver.get(url)
try:
a_tag = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '//a[starts-with(@href, "javascript:SetAzurePlayerFileName")]')))
href = a_tag.get_attribute('href')
match = re.search(r"javascript:SetAzurePlayerFileName\('(.*)',", href)
if match:
return match.group(1)
except Exception as e:
print(e)
finally:
driver.quit()
return None
Does anyone have any ideas on how I can successfully extract the URL from the href attribute of this tag?
Note: I believe that the main issue is that the content doesn't seem to be loaded in this mode of Python, rather than a problem with the regex or the XPath expression.