Selenium xpath: Trying to get original url from archived link

Question

I am working on a project, trying to scrape articles from archive websites. For example, below is an archive url and the original url. I have the archive url. And I want to use Selenium to extract the original url.

Arhive url: https://archive.is/xXAoL

Original url: https://beforeitsnews.com/eu/2021/08/breaking-germany-halts-all-covid-19-vaccines-says-they-are-unsafe-and-no-longer-recommended-2676130.html?fbclid=IwAR3JPcxNHlZ5eQHLyO2teh6_xcrerisBrCNeleOZz7qmxI7_pDJDBlEAIjU

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

url = "https://archive.is/xXAoL"
driver = webdriver.Chrome('./chromedriver')
driver.get(url)

Any advice on how to get the original url?

Method 1

One thing that might work is that the canonical link is

https://archive.is/2021.09.07-145059/https://beforeitsnews.com/eu/2021/08/breaking-germany-halts-all-covid-19-vaccines-says-they-are-unsafe-and-no-longer-recommended-2676130.html?fbclid=IwAR3JPcxNHlZ5eQHLyO2teh6_xcrerisBrCNeleOZz7qmxI7_pDJDBlEAIjU

I could just strip out things up until the second https. However, that method is not working so looking for another method not relying on meta.

undetected Selenium · Accepted Answer · 2022-01-24T22:19:59.960

0

To extract the original url you need to induce WebDriverWait for the visibility_of_element_located() and you can use either of the following Locator Strategies:

Using CSS_SELECTOR:

driver.get('https://archive.is/xXAoL')
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "input[name='q'][value]"))).get_attribute("value"))

Using XPATH:

driver.get('https://archive.is/xXAoL')
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//input[@name='q'][@value]"))).get_attribute("value"))

Console Output:

https://beforeitsnews.com/eu/2021/08/breaking-germany-halts-all-covid-19-vaccines-says-they-are-unsafe-and-no-longer-recommended-2676130.html?fbclid=IwAR3JPcxNHlZ5eQHLyO2teh6_xcrerisBrCNeleOZz7qmxI7_pDJDBlEAIjU

Note : You have to add the following imports :

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

edited Jan 24 '22 at 22:19

answered Jan 24 '22 at 22:08

undetected Selenium

183,867
41
278
352

I am getting `NameError: name 'EC' is not defined`...what is EC supposed to be? – asd Jan 24 '22 at 22:11
Okay the imports fixed that. I still get "NameError: name 'by_CSS_SELECTOR' is not defined" – asd Jan 24 '22 at 22:15
It should be `By.CSS_SELECTOR` – undetected Selenium Jan 24 '22 at 22:17
Using xpath and is working well, scraped about a hundred so far – asd Jan 24 '22 at 23:07

Selenium xpath: Trying to get original url from archived link

1 Answers1