Printing same HTTPResponse Object returns different outputs - Python

Question

def crawl(url):
    html = getHTML(url) # getHTML() retruns HTTPResponse
    print(html.read()) # PRINT STATMENT 1
    if (html == None):
        print("Error getting HTML")
    else:
        # parse html
        bsObj = BeautifulSoup(html, "lxml")
        # print data
        try:
            print(bsObj.h1.get_text())
        except AttributeError as e:
            print(e)

        print(html.read()) # PRINT STAETMENT 2

What I don't understand is..

PRINT STATEMENT 1 prints the whole html whereas PRINT STATEMENT 2 prints only b''

What is happening here? ..I'm quite new to Python.

As an aside, you shouldn't do `html == None`. See http://stackoverflow.com/questions/14247373/python-none-comparison-should-i-use-is-or. — edwinksl, Jun 19 '16 at 09:36

Alastair McCormack · Accepted Answer · 2016-06-19T10:01:42.617

html is an HTTPResponse object. HTTPResponse supports file-like operations, such as read().

Just like when reading a file, a read() consumes the available data and moves the file pointer to the end of the file/data. A subsequent read() has nothing to return.

You have two options:

Reset the file pointer to the beginning after reading using the seek() method:

print(html.read())
html.seek(0) # moves the file pointer to byte 0 relative to the start of the file/data

Save the result instead:

html_body = html.read()
print(html_body)

Typically, you would use the second option as it'll be easier to re-use html_body

Printing same HTTPResponse Object returns different outputs - Python

1 Answers1