2

I'm trying to write a program that will (among other things) get text or source code from a predetermined website. I'm learning Python to do this, and most sources have told me to use urllib2. Just as a test, I tried this code:

import urllib2
response = urllib2.urlopen('http://www.python.org')
html = response.read()

Instead of acting in any expected way, the shell just sits there, like it's waiting for some input. There aren't even an ">>>" or "...". The only way to exit this state is with [ctrl]+c. When I do this, I get a whole bunch of error messages, like

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/m/mls/pkg/ix86-Linux-RHEL5/lib/python2.5/urllib2.py", line 124, in urlopen
    return _opener.open(url, data)
  File "/m/mls/pkg/ix86-Linux-RHEL5/lib/python2.5/urllib2.py", line 381, in open
    response = self._open(req, data)

I'd appreciate any feedback. Is there a different tool than urllib2 to use, or can you give advice on how to fix this. I'm using a network computer at my work, and I'm not entirely sure how the shell is configured or how that might affect anything.

Brad Elliott
  • 35
  • 1
  • 7
  • You're getting a stack trace, meaning an exception was thrown. Posting the entire stack trace will make the diagnosis easier. – mipadi Jan 06 '12 at 17:09
  • In my case it was a firewall issue. My local firewall LuLu was blocking all python requests. Deleting that rule solved that issue. – asmaier Oct 02 '19 at 12:17

4 Answers4

4

With 99.999% probability, it's a proxy issue. Python is incredibly bad at detecting the right http proxy to use, and when it cannot find the right one, it just hangs and eventually times out.

So first you have to find out which proxy should be used, check the options of your browser (Tools -> Internet Options -> Connections -> LAN Setup... in IE, etc). If it's using a script to autoconfigure, you'll have to fetch the script (which should be some sort of javascript) and find out where your request is supposed to go. If there is no script specified and the "automatically determine" option is ticked, you might as well just ask some IT guy at your company.

I assume you're using Python 2.x. From the Python docs on urllib :

# Use http://www.someproxy.com:3128 for http proxying
proxies = {'http': 'http://www.someproxy.com:3128'}
filehandle = urllib.urlopen(some_url, proxies=proxies)

Note that the point on ProxyHandler figuring out default values is what happens already when you use urlopen, so it's probably not going to work.

If you really want urllib2, you'll have to specify a ProxyHandler, like the example in this page. Authentication might or might not be required (usually it's not).

Giacomo Lacava
  • 1,784
  • 13
  • 25
  • Thank you. It turns out that this was indeed a proxy issue. I resolved it using `proxypassmgr = urllib2.HTTPPasswordMgrWithDefaultRealm() proxypassmgr.add_password(None, 'http://proxyaddress:portnumber', username, password) authinfo = urllib2.ProxyBasicAuthHandler(proxypassmgr) proxy_support = urllib2.ProxyHandler({"http" : "http://cache1.lexmark.com:80"}) opener = urllib2.build_opener(proxy_support, authinfo) urllib2.install_opener(opener) req = urllib2.Request(theurl)` – Brad Elliott Feb 27 '12 at 20:59
3

This isn't a good answer to "How to do this with urllib2", but let me suggest python-requests. The whole reason it exists is because the author found urllib2 to be an unwieldy mess. And he's probably right.

Tom
  • 22,301
  • 5
  • 63
  • 96
0

That is very weird, have you tried a different URL?
Otherwise there is HTTPLib, however it is more complicated. Here's your example using HTTPLib

import httplib as h
domain = h.HTTPConnection('www.python.org')
domain.connect()
domain.request('GET', '/fish.html')
response = domain.getresponse()
if response.status == h.OK:
    html = response.read()
Grace B
  • 1,386
  • 11
  • 30
  • This is doing the same no-response thing about the third line. Here are the errors it gives: Traceback (most recent call last): File "", line 1, in ? File "/usr/lib/python2.4/httplib.py", line 626, in connect self.sock.connect(sa) File "", line 1, in connect – Brad Elliott Jan 06 '12 at 18:39
  • Python 2.4? How old is your setup? – Has QUIT--Anony-Mousse Jan 06 '12 at 22:45
  • like I said, have you tried with another site? Because just going to `http://python.org/fish.html` in Chrome results in a 404, which would be the cause of the error – Grace B Jan 07 '12 at 09:27
0

I get a 404 error almost immediately (no hanging):

>>> import urllib2
>>> response = urllib2.urlopen('http://www.python.org/fish.html')
Traceback (most recent call last):
  ...
urllib2.HTTPError: HTTP Error 404: Not Found

If I try and contact an address that doesn't have an HTTP server running, it hangs for quite a while until the timeout happens. You can shorten it by passing the timeout parameter to urlopen:

>>> response = urllib2.urlopen('http://cs.princeton.edu/fish.html', timeout=5)
Traceback (most recent call last):
  ...
urllib2.URLError: <urlopen error timed out>
jterrace
  • 64,866
  • 22
  • 157
  • 202
  • Yeah, delete the "fish" part. That page doesn't exist, and I don't know where I got that. I'm trying it with just www.python.org now, but it's still not working. – Brad Elliott Jan 06 '12 at 18:40