0

I am new to python and am trying to learn some new modules. Fortunately or unfortunately, I picked up the urllib2 module and started using it with one URL that's causing me problems.

To begin with, I created the Request object and then called Read() on the response object. It was failing. Turns out its getting redirected but the error code is still 200. Not sure what's going on. Here is the code --

def get_url_data(url):
    print "Getting URL " + url
    user_agent = "Mozilla/5.0 (Windows NT 6.0; rv:14.0) Gecko/20100101 Firefox/14.0.1"
    headers = { 'User-Agent' : user_agent }
    request = urllib2.Request(url, str(headers) )

    try:    
        response = urllib2.urlopen(request)
    except urllib2.HTTPError, e:
        print response.geturl()
        print response.info()
        print response.getcode()
        return False;
    else:
        print response
        print response.info()
        print response.getcode()
        print response.geturl()
        return response

I am calling the above function with http://www.chilis.com".

I was expecting to receive a 301, 302, or 303 but instead I see 200. Here is the response I see --

Getting URL http://www.chilis.com
<addinfourl at 4354349896 whose fp = <socket._fileobject object at 0x1037513d0>>
Cache-Control: private
Server: Microsoft-IIS/7.5
SPRequestGuid: 48bbff39-f8b1-46ee-a70c-bcad16725a4d
X-SharePointHealthScore: 0
X-AspNet-Version: 2.0.50727
X-Powered-By: ASP.NET
MicrosoftSharePointTeamServices: 14.0.0.6120
X-MS-InvokeApp: 1; RequireReadOnly
Date: Wed, 13 Feb 2013 11:21:27 GMT
Connection: close
Content-Length: 0
Set-Cookie: BIGipServerpool_http_chilis.com=359791882.20480.0000; path=/

200
http://www.chilis.com/(X(1)S(q24tqizldxqlvy55rjk5va2j))/Pages/ChilisVariationRoot.aspx?AspxAutoDetectCookieSupport=1

Can someone explain what this URL is and how to handle this? I know I can use the "Handling Redirects" section from Diveintopython.net but there also with the code on that page I see the same response 200.

EDIT: Using the code from DiveintoPython, I see its a temporary redirection. What I don't understand is why the HTTP Errorcode from code is 200. Isn't that supposed to be the actual return code?

EDIT2: Now that I see it better, its not a weird redirection at all. I am editing the title.

EDIT3: If urllib2 follows the redirection automatically, I am not sure why the following code does not get the front page for chilis.com.

docObj = get_url_data(url)
doc = docObj.read()
soup = BeautifulSoup(doc, 'lxml')
print(soup.prettify())

If I use the URL that the browser eventually ends up getting redirected to it works (http://www.chilis.com/EN/Pages/home.aspx").

R11
  • 405
  • 2
  • 6
  • 15

1 Answers1

2

urllib2 automatically follows redirects, so the information you're seeing is from the page that was redirected to.

If you don't want it to follow redirect, you'll need to subclass urllib2.HTTPRedirectHandler. Here's a relevant SO posting on how to do that: How do I prevent Python's urllib(2) from following a redirect

Regarding EDIT 3: it looks like www.chilis.com requires accepting cookies. This can be implemented using urllib2, but I would suggest installing the requests module (http://pypi.python.org/pypi/requests/).

The following seems to do exactly what you want (without any error handling):

import requests

r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
print(soup.prettify())
Community
  • 1
  • 1
robertklep
  • 198,204
  • 35
  • 394
  • 381
  • Thanks for the reply. Understood why I see 200. Could you see the EDIT3 above for a follow up question? – R11 Feb 13 '13 at 12:24
  • Thanks! Did not know about the requests module. – R11 Feb 13 '13 at 13:08
  • Do you know how the requests module does it internally? Does it send some fake cookies? – R11 Feb 13 '13 at 13:09
  • No, it handles cookies pretty much like a regular browser would :) Cookies aren't something special, just pieces of text that are send to/from the browser from/to the server; `requests` implements that 'protocol'. – robertklep Feb 13 '13 at 13:14
  • You are right, I am an idiot. There need not be any cookies present for that URL. So pretty much nothing special is needed. Wonder why urllib2 does not do it right. Anyway, requests made my it really easy for me. Thanks again. – R11 Feb 13 '13 at 13:19
  • I used `httpie` (also a very fine Python product) to debug requests to `chilis.com` and I'm getting caught in redirect loops when I *don't* accept cookies, hence my conclusion that accepting cookies was required. – robertklep Feb 13 '13 at 13:20