I'm trying to use the Python library newspaper with the archives from the Wayback Machine, which stores old versions of websites that were archived. Theoretically, old news articles could be queried and downloaded from these archives.
For instance, the follow code queries the archives for CNBC for a specific archive date.
import newspaper
url = 'http://web.archive.org/web/20161201123529/http://www.cnbc.com/'
paper = newspaper.build(url, memoize_articles = False )
Although the archived website itself contains links to actual news articles from 2016-12-01, the newspaper module does not seem to pick them up. Instead, you get urls such as:
https://blog.archive.org/2016/10/23/defining-web-pages-web-sites-and-web-captures/
which are not actual articles from this archived version of CNBC. However, newspaper works great with today's version of CNBC.
I suppose that it gets confused because of the format of the url (which contains two http
s). Does anyone have any suggestions on how to extract articles from the Wayback Machine archives?