Downloading a protected file using python urllib

Question

I am trying to download a PDF file that is located here http://elwatan.com/pdf/telecharger.php?dir=JOURNAL&file=20120524.pdf , however, this pdf file require to be logged in before you download it. I was able to log in, but the server redirects me to the home page http://elwatan.com , and when i try to fetch the pdf's url again, i can't download it cause it seems that i am not logged in ! I think that i need to use cookies, right?

if yes, can you please explain me how to, cause i never used them before. ?

Thank's :)

Like this maybe??? http://stackoverflow.com/questions/8734876/urllib2-with-cookies — Maria Zverina, May 25 '12 at 13:21
Or this http://stackoverflow.com/questions/7162850/pass-session-cookies-in-http-header-with-python-urllib2 — Maria Zverina, May 25 '12 at 13:23

score 2 · Answer 1 · answered May 25 '12 at 13:24

2

The mechanize library is very useful for situations like this. It simulates the browser, which includes filling in forms (like login forms) and keeping state such as cookies. With it, you could log in to the site and then navigate to the pdf file. You would use something like the following code:

br = mechanize.Browser()
br.open(login_url)
#code to log in with br
data = br.open(pdf_url).get_data()

You would then have to parse the data as a pdf file and then you can do whatever you need to with it.

answered May 25 '12 at 13:24

murgatroid99

19,007
10
60
95

I haven't used mechanize for pdfs before, so I'm not exactly sure, but the data should be the pdf. You would probably have to use some other pdf library to actually get anything useful out of it. – murgatroid99 May 25 '12 at 13:35

score 1 · Answer 2 · answered May 25 '12 at 13:40

When using that web application, a "session" is generated for you. Session details are stored in your client within a cookie. Your client sends the cookie contents with each HTTP request. By doing so, the web application knows that your HTTP requests correspond to the same session. Initially, you are just an unknown user within that session. After logging in, the web application knows that requests within that session come from an authorized user.

You have two options:

log in via browser, craft the cookie and fake the browser in subsequent requests using Python
do everything in Python (starting from the initial request, logging in, document retrieval)

Both can be a considerable amount of work (especially if you are new to these things), because you have to adjust your code to the specifics of the web application. A library like mechanize (as already mentioned by others) can save some work.

Downloading a protected file using python urllib

2 Answers2