0
def download(source_url):
    try:
        socket.setdefaulttimeout(20)
        agents = ['Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1041.0 Safari/535.21','Mozilla/5.0 (Windows NT 6.1; rv:12.0) Gecko/20120403211507 Firefox/12.0']
        ree = urllib2.Request(source_url)
        ree.add_header('User-Agent',random.choice(agents))
        resp = urllib2.urlopen(ree)
        htmlSource = resp.read()
        return htmlSource
    except Exception, e:
        print e
        return ""

I wrote this download function. How do I make it work with 301/302?

Ex: http://tumblr.com/tagged/long-reads my function doesn't work with this url.

TIMEX
  • 259,804
  • 351
  • 777
  • 1,080

2 Answers2

1

First, you have to get the HTTP response code, look at this.

If code is 30x, you have to get new url, look at this.

Then you can recursively call your function download() with new URL.

You should also add one parametr as redirection counter to avoid infinite looping.

Community
  • 1
  • 1
JerabekJakub
  • 5,268
  • 4
  • 26
  • 33
0

If a redirect (301/2) code is returned, urllib2 should follow that redirect automatically.

Look at this related question. If it does not follow the redirect in your case this article examines in detail redirects handling.

Community
  • 1
  • 1
Joseph Victor Zammit
  • 14,760
  • 10
  • 76
  • 102