0

How would I go about reversing the process of Google's AMP api?

I am looking to take an AMP (accelerated mobile page) URL and come up with the regular (original) URL. I was wondering if anyone has the answer as to how to do this in Python (or any other language for that matter)? Any help would be greatly appreciated.

An example:

https://amp.cnn.com/cnn/2018/03/08/politics/jeff-flake-anti-tariff-bill/
Expected output:
https://cnn.com/2018/03/08/politics/jeff-flake-anti-tariff-bill/

A second example:

https://www.google.ca/amp/s/mobile.nytimes.com/2018/03/08/us/politics/trump-tariff-announcement.amp.html
Expected output:
https://www.nytimes.com/2018/03/08/us/politics/trump-tariff-announcement.html

A third (and final) example:

https://www.google.ca/amp/s/www.theverge.com/platform/amp/2018/3/8/17097904/android-ios-smartphone-brand-loyalty
Expected output:
https://www.theverge.com/2018/3/8/17097904/android-ios-smartphone-brand-loyalty

The unfortunate thing is that the implementation of AMP appears to vary considerably. I guess one approach could be to just chop out any "amp" and surrounding dots (.) or slashes (/), however, I could imagine a scenario where that would not be the wisest approach (mainly if the page URL actually was supposed to have amp in its ending etc (and it appeared in regular browsing).

Markyroson
  • 109
  • 10

3 Answers3

2

AMP pages are required to reference their canonical version via:

<link rel="canonical" href="https://www.example.com/url/to/full/document.html">

The correct way to discover the non-AMP version of a page, is to fetch the AMP document and extract the href value of it's canonical link tag.

You can read more about this in the official documentation.

Sebastian Benz
  • 4,238
  • 1
  • 21
  • 17
1

For Python 3, another option could be opening the url and getting the final url from the response. Following @jadelord's answer on another question:

import urllib
def resolve(url):
    return urllib.request.urlopen(url).geturl()
0

For anyone who runs into this in the future, I thought I would share my solution. Using the information from @daKmoR, I was able to eventually come up with the following:

import metadata_parser
page = metadata_parser.MetadataParser(url="https://amp.cnn.com/cnn/2018/03/08/politics/jeff-flake-anti-tariff-bill/ ")
#page = metadata_parser.MetadataParser(url="https://www.google.ca/amp/s/www.theverge.com/platform/amp/2018/3/8/17097904/android-ios-smartphone-brand-loyalty/")
#print(page.metadata)
#TODO: Doesnt work for verge
print("New")
real_URL = page.get_metadata_link('url')
if real_URL:
    print(real_URL)
else:
    print("Boo")

If you run into errors like "TLSV1_ALERT_PROTOCOL_VERSION", then you are probably compiling with an outdated Python version. "metadata_parser" referenced above is available on GitHub.

EDIT: Here is the updated code per @sebastian-benz 's response.

import metadata_parser
#page = metadata_parser.MetadataParser(url="https://amp.cnn.com/cnn/2018/03/08/politics/jeff-flake-anti-tariff-bill/ ")
page = metadata_parser.MetadataParser(url="https://www.google.ca/amp/s/mobile.nytimes.com/2018/03/08/us/politics/trump-tariff-announcement.amp.html")
#print(page.metadata)
#TODO: Doesnt work for verge
print("New")
#real_URL = page.get_metadata_link('url')
real_URL = page.get_url_canonical()
if real_URL:
    print(real_URL)
else:
    print("Boo")
Markyroson
  • 109
  • 10