I am working on a project, trying to scrape articles from archive websites. For example, below is an archive url and the original url. I have the archive url. And I want to use Selenium to extract the original url.
Arhive url: https://archive.is/xXAoL
Original url: https://beforeitsnews.com/eu/2021/08/breaking-germany-halts-all-covid-19-vaccines-says-they-are-unsafe-and-no-longer-recommended-2676130.html?fbclid=IwAR3JPcxNHlZ5eQHLyO2teh6_xcrerisBrCNeleOZz7qmxI7_pDJDBlEAIjU
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
url = "https://archive.is/xXAoL"
driver = webdriver.Chrome('./chromedriver')
driver.get(url)
Any advice on how to get the original url?
Method 1
One thing that might work is that the canonical link is
https://archive.is/2021.09.07-145059/https://beforeitsnews.com/eu/2021/08/breaking-germany-halts-all-covid-19-vaccines-says-they-are-unsafe-and-no-longer-recommended-2676130.html?fbclid=IwAR3JPcxNHlZ5eQHLyO2teh6_xcrerisBrCNeleOZz7qmxI7_pDJDBlEAIjU
I could just strip out things up until the second https. However, that method is not working so looking for another method not relying on meta.