There's no real problem with your code (except for a bunch of indentation errors and syntax errors that apparently aren't in your real code), only with your attempts to debug it.
First, you did this:
print htmldata
That's perfectly fine, but since htmldata
is a urllib2
response object, printing it just prints that response object. Which apparently looks like this:
<addinfourl at 140176301032584 whose fp = <socket._fileobject object at 0x7f7d56a09750>>
That doesn't look like particularly useful information, but that's the kind of output you get when you print something that's only really useful for debugging purposes. It tells you what type of object it is, some unique identifier for it, and the key members (in this case, the socket fileobject wrapped up by the response).
Then you apparently tried this:
print htmldata.read()
But already called read
on the same object earlier:
parser.feed(htmldata.read())
When you read()
the same file-like object twice, the first time gets everything in the file, and the second time gets everything after everything in the file—that is, nothing.
What you want to do is read()
the contents once, into a string, and then you can reuse that string as many times as you want:
contents = htmldata.read()
parser.feed(contents)
print contents
It's also worth noting that, as the urllib2
documentation said right at the top:
See also The Requests
package is recommended for a higher-level HTTP client interface.
Using urllib2
can be a big pain, in a lot of ways, and this is just one of the more minor ones. Occasionally you can't use requests
because you have to dig deep under the covers of HTTP, or handle some protocol it doesn't understand, or you just can't install third-party libraries, so urllib2
(or urllib.request
, as it's renamed in Python 3.x) is still there. But when you don't have to use it, it's better not to. Even Python itself, in the ensurepip
bootstrapper, uses requests
instead of urllib2
.
With requests
, the normal way to access the contents of a response is with the content
(for binary) or text
(for Unicode text) properties. You don't have to worry about when to read()
; it does it automatically for you, and lets you access the text
over and over. So, you can just do this:
import requests
base_url = "http://status.aws.amazon.com/"
response = requests.get(base_url, timeout=30)
parser.feed(response.content) # assuming it wants bytes, not unicode
print response.text