I created a function to read HTML content from specific url. Here is the code:
def __retrieve_html(self, address):
html = urllib.request.urlopen(address).read()
Helper.log('HTML length', len(html))
Helper.log('HTML content', html)
return str(html)
However the function is not always return the correct string. In some cases it returns a very long weird string.
For example if I use the URL: http://www.merdeka.com
, sometimes it will give the correct html string, but sometimes also returns a result like:
HTML content: b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xed\xfdyW\x1c\xb7\xd28\x8e\xffm\x9f\x93\xf7\xa0;y>\xc1\xbeA\xcc\xc2b\x03\x86\x1cl\xb0\x8d1\x86\x038yr\......Very long and much more characters.
It seems that it only happen in any pages that have a lot of content. For simple page like Facebook.com login page and Google.com index, it never happened. What is this? Where is my mistake and how to handle it?