2

I'd like to download a series of pdf files from my intranet. I'm able to see the files in my web browser without issue, but when trying to automate the pulling of the file via python, I run into problems. After talking through the proxy set up at my office, I can download files from the internet quite easily with this answer:

url = 'http://www.sample.com/fileiwanttodownload.pdf'

user = 'username'
pswd = 'password'
proxy_ip = '12.345.56.78:80'
proxy_url = 'http://' + user + ':' + pswd + '@' + proxy_ip
proxy_support = urllib2.ProxyHandler({"http":proxy_url})
opener = urllib2.build_opener(proxy_support,urllib2.HTTPHandler)
urllib2.install_opener(opener)

file_name = url.split('/')[-1]
u = urllib2.urlopen(url)
f = open(file_name, 'wb')
f.close()

but for whatever reason it won't work if the url is pointing to something on my intranet. The following error is returned:

Traceback (most recent call last):

  File "<ipython-input-13-a055d9eaf05e>", line 1, in <module>
    runfile('C:/softwaredev/python/pdfwrite.py', wdir='C:/softwaredev/python')

  File "C:\Anaconda\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 585, in runfile
    execfile(filename, namespace)

  File "C:/softwaredev/python/pdfwrite.py", line 26, in <module>
    u = urllib2.urlopen(url)

  File "C:\Anaconda\lib\urllib2.py", line 127, in urlopen
    return _opener.open(url, data, timeout)

  File "C:\Anaconda\lib\urllib2.py", line 410, in open
    response = meth(req, response)

  File "C:\Anaconda\lib\urllib2.py", line 523, in http_response
    'http', request, response, code, msg, hdrs)

  File "C:\Anaconda\lib\urllib2.py", line 442, in error
    result = self._call_chain(*args)

  File "C:\Anaconda\lib\urllib2.py", line 382, in _call_chain
    result = func(*args)

  File "C:\Anaconda\lib\urllib2.py", line 629, in http_error_302
    return self.parent.open(new, timeout=req.timeout)

  File "C:\Anaconda\lib\urllib2.py", line 410, in open
    response = meth(req, response)

  File "C:\Anaconda\lib\urllib2.py", line 523, in http_response
    'http', request, response, code, msg, hdrs)

  File "C:\Anaconda\lib\urllib2.py", line 448, in error
    return self._call_chain(*args)

  File "C:\Anaconda\lib\urllib2.py", line 382, in _call_chain
    result = func(*args)

  File "C:\Anaconda\lib\urllib2.py", line 531, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)

HTTPError: Service Unavailable

Using requests.py in the following code, I can successfully pull down files from the internet, but when trying to pull a pdf from my office intranet, I just get a connection error sent back to me in html. The following code is run:

import requests

url = 'www.intranet.sample.com/?layout=attachment&cfapp=26&attachmentid=57142'

proxies = {
  "http": "http://12.345.67.89:80",
  "https": "http://12.345.67.89:80"
}

local_filename = 'test.pdf'
r = requests.get(url, proxies=proxies, stream=True)
with open(local_filename, 'wb') as f:
    for chunk in r.iter_content(chunk_size=1024): 
        print chunk
        if chunk:
            f.write(chunk)
            f.flush()

And the html that comes back:

Network Error (tcp_error) 

A communication error occurred: "No route to host"
The Web Server may be down, too busy, or experiencing other problems preventing it from responding to requests. You may wish to try again at a later time.

For assistance, contact your network support team.

Is it possible that there be some network security setting that prevents automated requests outside the web browser environment?

Community
  • 1
  • 1
thomastodon
  • 340
  • 1
  • 3
  • 12

2 Answers2

1

Installing openers into urllib2 doesn't affect requests. You need to use requests' own support for proxies. It should be enough to pass them in the proxies argument to get, or you can set the HTTP_PROXY and HTTPS_PROXY environment variables. See http://docs.python-requests.org/en/latest/user/advanced/#proxies

import requests

proxies = {
  "http": "http://10.10.1.10:3128",
  "https": "http://10.10.1.10:1080",
}

requests.get("http://example.org", proxies=proxies)
asmeurer
  • 86,894
  • 26
  • 169
  • 240
0

Have you tried not using the proxy to download your files when it's on the intranet?

You could try something like this in python2

from urllib2 import urlopen

url = 'http://intranet/myfile.pdf'

with open(local_filename, 'wb') as f:
    f.write(urlopen(url).read())
yensa
  • 1