0

I created a function to read HTML content from specific url. Here is the code:

def __retrieve_html(self, address):
    html = urllib.request.urlopen(address).read()
    Helper.log('HTML length', len(html))
    Helper.log('HTML content', html)
    return str(html)

However the function is not always return the correct string. In some cases it returns a very long weird string.

For example if I use the URL: http://www.merdeka.com, sometimes it will give the correct html string, but sometimes also returns a result like:

HTML content: b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xed\xfdyW\x1c\xb7\xd28\x8e\xffm\x9f\x93\xf7\xa0;y>\xc1\xbeA\xcc\xc2b\x03\x86\x1cl\xb0\x8d1\x86\x038yr\......Very long and much more characters.

It seems that it only happen in any pages that have a lot of content. For simple page like Facebook.com login page and Google.com index, it never happened. What is this? Where is my mistake and how to handle it?

yunhasnawa
  • 815
  • 1
  • 14
  • 30

2 Answers2

2

It appears the response from http://www.merdeka.com is gzipped compressed.

Give this a try:

import gzip
import urllib.request
def __retrieve_html(self, address):
    with urllib.request.urlopen(address) as resp:
        html = resp.read()
        Helper.log('HTML length', len(html))
        Helper.log('HTML content', html)
        if resp.info().get('Content-Encoding') == 'gzip':
            html = gzip.decompress(html)
        return html

How to decode your html object, I leave as an exercise to you.

Alternatively, you could just use the Requests module: http://docs.python-requests.org/en/latest/

Install it with:

pip install requests

Then execute like:

import requests
r = requests.get('http://www.merdeka.com')
r.text

Requests didn't appear to have any trouble with the response from http://www.merdeka.com

Joe Young
  • 5,749
  • 3
  • 28
  • 27
1

You've got bytes instead of string, because urrlib can't decode the response for you. This could be because some sites omit encoding declaration in their content-type header.

For example, google.com has:

Content-Type: text/html; charset=UTF-8

and that http://www.merdeka.com website has just:

Content-Type: text/html

So, you need to manually decode the response, for example with utf-8 encoding

html = urllib.request.urlopen(address).read().decode('utf-8')

The problem is that you need to set correct encoding and if it is not in the server headers, your need to guess it somehow.

See this question for more information How to handle response encoding from urllib.request.urlopen()

PS: Consider moving from somewhat deprecated urllib to the requests lib. It's simplier, trendier and sexier at this time :)

Community
  • 1
  • 1
anti1869
  • 1,219
  • 1
  • 10
  • 18