2

I am trying to make a small program that downloads subtitles for movie files.

I noticed however that when I follow a link in chrome and when opening it with urllib2.urlopen() does not give the same results.

As an example let's consider the link http://www.opensubtitles.org/en/subtitleserve/sub/5523343 . In chrome this redirects to http://osdownloader.org/en/osdownloader.subtitles-for.you/subtitles/5523343 which after a little while downloads the file I want.

However, when I use the following code in python, I get redirected to another page:

import urllib2
url = "http://www.opensubtitles.org/en/subtitleserve/sub/5523343"
response = urllib2.urlopen(url)

if response.url == url:
  print "No redirect"
else: 
  print url, " --> ", response.url

Result: http://www.opensubtitles.org/en/subtitleserve/sub/5523343 --> http://www.opensubtitles.org/en/subtitles/5523343/the-musketeers-commodities-en

Why does that happen? How can I follow the same redirect as with the browser?

(I know that these sites offer APIs in python, but this is meant as practice in python and playing with urllib2 for the first time)

Cantfindname
  • 2,008
  • 1
  • 17
  • 30

1 Answers1

2

There's a significant difference in the request you're making from Chrome and your script using urllib2 above, and that is the HTTP header User-Agent (https://en.wikipedia.org/wiki/User_agent).

opensubtitles.org probably identifies that you're trying to programmatically retrieving the webpage, and are blocking it. Try to use one of the User-Agent strings from Chrome (more here http://www.useragentstring.com/pages/Chrome/):

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.1 Safari/537.36

in your script.

See this question on how to edit your script to support a custom User-Agent header - Changing user agent on urllib2.urlopen.

I would also like to recommend using the requests library for Python instead of urllib2, as the API is much easier to understand - http://docs.python-requests.org/en/latest/.

Community
  • 1
  • 1
Niklas9
  • 8,816
  • 8
  • 37
  • 60
  • Changing the user agent does not fix the problem, neither with urllib2 nor with requests – Cantfindname Jan 17 '16 at 22:12
  • Ah, I've looked into this a bit further now @Cantfindname, it seems like they're doing a redirect in JavaScript to the file to be downloaded.. To do this programmatically (no matter if you're using urllib2, requests or any other language rather than Python) is to parse the html/javascript and figure out what the link is and then do a new request to the file URL. – Niklas9 Jan 17 '16 at 22:16