23

I have been using Amazon's Product Advertising API to generate urls that contains prices for a given book. One url that I have generated is the following:

http://www.amazon.com/gp/offer-listing/0415376327%3FSubscriptionId%3DAKIAJZY2VTI5JQ66K7QQ%26tag%3Damaztest04-20%26linkCode%3Dxm2%26camp%3D2025%26creative%3D386001%26creativeASIN%3D0415376327

When I click on the link or paste the link on the address bar, the web page loads fine. However, when I execute the following code I get an error:

url = "http://www.amazon.com/gp/offer-listing/0415376327%3FSubscriptionId%3DAKIAJZY2VTI5JQ66K7QQ%26tag%3Damaztest04-20%26linkCode%3Dxm2%26camp%3D2025%26creative%3D386001%26creativeASIN%3D0415376327"
html_contents = urllib2.urlopen(url)

The error is urllib2.HTTPError: HTTP Error 503: Service Unavailable. First of all, I don't understand why I even get this error since the web page successfully loads.

Also, another weird behavior that I have noticed is that the following code sometimes does and sometimes does not give the stated error:

html_contents = urllib2.urlopen("http://www.amazon.com/gp/offer-listing/0415376327%3FSubscriptionId%3DAKIAJZY2VTI5JQ66K7QQ%26tag%3Damaztest04-20%26linkCode%3Dxm2%26camp%3D2025%26creative%3D386001%26creativeASIN%3D0415376327")

I am totally lost on how this behavior occurs. Is there any fix or work around to this? My goal is to read the html contents of the url.

EDIT

I don't know why stack overflow is changing my code to change the amazon link I listed above in my code to rads.stackoverflow. Anyway, ignore the rads.stackoverflow link and use my link above between the quotes.

ruthless
  • 1,090
  • 4
  • 16
  • 36
  • If I'm not mistaken, `rads.stackoverflow.com` is (or was) an advertising service the SO implemented and then scrapped. It may very well be that there is some sort of use limitation (referrer, client and what not) – Germano Sep 19 '14 at 14:27
  • For some random reason, I don't know why the link changes to contain the stack overflow tag. However, if I keep the copy and paste link on the address bar, the website works fine. – ruthless Sep 19 '14 at 14:31
  • Ah I see! Nice :) This must be SO comment parser. – Germano Sep 19 '14 at 14:33

2 Answers2

27

Amazon is rejecting the default User-Agent for urllib2 . One workaround is to use the requests module

import requests
page = requests.get("http://www.amazon.com/gp/offer-listing/0415376327%3FSubscriptionId%3DAKIAJZY2VTI5JQ66K7QQ%26tag%3Damaztest04-20%26linkCode%3Dxm2%26camp%3D2025%26creative%3D386001%26creativeASIN%3D0415376327")
html_contents = page.text

If you insist on using urllib2, this is how a header can be faked to do it:

import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
response = opener.open('http://www.amazon.com/gp/offer-listing/0415376327%3FSubscriptionId%3DAKIAJZY2VTI5JQ66K7QQ%26tag%3Damaztest04-20%26linkCode%3Dxm2%26camp%3D2025%26creative%3D386001%26creativeASIN%3D0415376327')
html_contents = response.read()

Don't worry about stackoverflow editing the URL. They explain that they are doing this here.

Community
  • 1
  • 1
Spade
  • 2,220
  • 1
  • 19
  • 29
  • For some weird reason, the link changes to contain the stack overflow tag. However, if you copy and paste the link on the address bar, everything works fine. Can you update your answer using my following link to see if it works because it doesn't work for me? – ruthless Sep 19 '14 at 14:37
  • Stack Overflow is compressing longer links or probably any external link to display content in a cleaner fashion. This might also be out of security vulnerabilities of pasting actual links into what can be formatted as code on the interface. In your real code, put whatever link you like and everything should work fine. – Spade Sep 19 '14 at 15:12
  • python 3 version? – Jonathan Lam Mar 16 '19 at 00:14
14

It's because Amazon don't allow automated access to their data, so they're rejecting your request because it didn't come from a proper browser. If you look at the content of the 503 response, it says:

To discuss automated access to Amazon data please contact api-services-support@amazon.com. For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.com/ref=rm_5_sv, or our Product Advertising API at https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html/ref=rm_5_ac for advertising use cases.

This is because the User-Agent for Python's urllib is so obviously not a browser. You could always fake the User-Agent, but that's not really good (or moral) practice.

As a side note, as mentioned in another answer, the requests library is really good for HTTP access in Python.

Ben
  • 6,687
  • 2
  • 33
  • 46
  • I was looking into your statement about using a User-Agent and was wondering if I needed to do something along the lines of adding headers like this for urllib2: http://stackoverflow.com/questions/802134/changing-user-agent-on-urllib2-urlopen – ruthless Sep 19 '14 at 15:20
  • 1
    Yes, that's how you change the User-Agent. Again, the `requests` library [here](http://docs.python-requests.org/en/latest/) is much better for this. – Ben Sep 19 '14 at 15:58
  • 3
    There is nothing immoral about faking the user agent (or any other private data)!!! These companies (or anyone) don't have a god-given right to know what browser you use, or what device you own or what websites you visit. The more people lie about their data, the better. They can and will use that information against you. They don't hesitate a second to gain advantage over you, so shouldn't you! – uzumaki May 04 '21 at 15:45