0

I am trying to read in the html from a url. I tried the following:

import requests
f = requests.get('http://www.google.com')
print f.text

Which returned the following Traceback:

requests.exceptions.ConnectionError: HTTPConnectionPool(host='www.google.com', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x03142310>: Failed to establish a new connection: [Errno 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond',))

So, I am assuming that my work (university) has a Proxy. I used http://www.whatismyproxy.com/ to get the external IP, guessed that the port is 80, and generated the following code (IP has been changed):

import requests
f = requests.get(link, 
                 proxies={"http": "http://123.45.678.910:80"})
print f.text

This does something, but the html it returns is not that of Google (and is identical if I change the url to twitter):

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<html>
 <head>
  <title>Index of /</title>
 </head>
 <body>
<h1>Index of /</h1>
  <table>
   <tr><th valign="top"><img src="/icons/blank.gif" alt="[ICO]"></th><th><a href="?C=N;O=D">Name</a></th><th><a href="?C=M;O=A">Last modified</a></th><th><a href="?C=S;O=A">Size</a></th><th><a href="?C=D;O=A">Description</a></th></tr>
   <tr><th colspan="5"><hr></th></tr>
<tr><td valign="top"><img src="/icons/unknown.gif" alt="[   ]"></td><td><a href="direct.dat">direct.dat</a></td><td align="right">2013-10-24 18:09  </td><td align="right"> 73 </td><td>&nbsp;</td></tr>
<tr><td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td><td><a href="errors/">errors/</a></td><td align="right">2015-01-13 16:15  </td><td align="right">  - </td><td>&nbsp;</td></tr>
<tr><td valign="top"><img src="/icons/unknown.gif" alt="[   ]"></td><td><a href="filtered.dat">filtered.dat</a></td><td align="right">2015-02-06 13:39  </td><td align="right">3.0K</td><td>&nbsp;</td></tr>
<tr><td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td><td><a href="html/">html/</a></td><td align="right">2016-09-30 07:50  </td><td align="right">  - </td><td>&nbsp;</td></tr>
<tr><td valign="top"><img src="/icons/unknown.gif" alt="[   ]"></td><td><a href="wpad.dat">wpad.dat</a></td><td align="right">2016-03-30 05:16  </td><td align="right">2.5K</td><td>&nbsp;</td></tr>
   <tr><th colspan="5"><hr></th></tr>
</table>
<address>Apache/2.4.10 (Debian) Server at www.google.com Port 80</address>
</body></html>

Is this something I can fix, or is it related to my work's settings (and how do I confirm this)?

Phil
  • 71
  • 1
  • 7
  • Making progress: I went to chrome://net-internals/#proxy whihc gave me the url address for a PAC script. I went to that url which was a wpad.dat file, and this gave me some different proxy details. These work with google.com, but not the actual site I need! – Phil Aug 22 '17 at 11:12

1 Answers1

0

The proxy settings I needed where not viewable from another website. I obtained them from a wpad.dat file, which I found at wpad.myuniversityname.ac. A second useful note, is that you may need to extend the proxy settings dictionary to include both http and https settings:

proxies={"http": "http://123.45.678.910:80", "https": "http://123.45.678.910:80"}
Phil
  • 71
  • 1
  • 7