0

I am trying to download around 20 or so pdfs from a site that has a login. This is what I have so far but it fails to download any valid pdfs (i.e. they are all corrupted). I am also new to python.

import mechanize
import urllib2

def download_file(download_url):
    response = urllib2.urlopen(download_url)
    print response.geturl() 
    print response.read()
    file = open("document.pdf", 'wb')
    file.write(response.read())
    file.close()

brwser = mechanize.Browser()
brwser.addheaders = [('User-agent', 'Firefox')]
response = brwser.open(url)

brwser.select_form(nr = 0)
brwser.form['UserName'] = 'username'
brwser.form['Password'] = 'password'
nextpage = brwser.submit()

# Navigate to the page I want

for link in brwser.links():
    if link.text == 'Some pdf':
        request = brwser.follow_link(link)
        download_file(link.url)

I am not sure what to try. The urls for the pdfs are like this

https://example.com/something/source2.aspx?id=e9a9bfdc-7d97-e411-9e03-76439cf4d30e

Also the response.read() is as follows:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><title>
Source
</title>
<script type='text/javascript'>
   window.onload = function () {
       var url = window.location.href.replace('source.aspx?', 'source2.aspx?');
       window.location = url;
   };
</script>
</head>
<body>
<div style='position:fixed; height:100%; width:100%; overflow:hidden; top:100px; left:100px;'>Loading, please wait.</div>
</body>
</html>

So how do I download these files?

1 Answers1

0

You might consider using Selenium, which is perhaps better suited to interact with the site (not that mechanize isn't an excellent tool). There is decent documentation on how to accomplish this (e.g. here or here): the generally accepted approach is to tweak Firefox so that it saves the files rather than attempting to open them, and then access every link.

You may also find that when you resolve the links, you end up somewhere completely different depending on where the PDFs are and how they were generated. You could also roll in an approach like this one for link extraction.

Community
  • 1
  • 1
Hobbes
  • 134
  • 6