Decoding html file downloaded with urllib

Question

I tried to download a html file like this:

import urllib

req  = urllib.urlopen("http://www.stream-urls.de/webradio")
html = req.read()

print html

html = html.decode('utf-16')

print html

Since the output after req.read() looks like unicode I tried to convert the response but getting this error:

Traceback (most recent call last):   File
"e:\Documents\Python\main.py", line 8, in <module>
    html = html.decode('utf-16')   
File "E:\Software\Python2.7\lib\encodings\utf_16.py", line 16, in decode
    return codecs.utf_16_decode(input, errors, True) 
UnicodeDecodeError: 'utf16' codec can't decode bytes in position 38-39: illegal UTF-16 surrogate

What do I have to do to get the right encoding?

Well... could you by the least be kind enough to tell us what the bytes in position 38-39 are??? — barak manos, Dec 20 '16 at 12:12
BTW, the problem at hand has nothing to do with `urllib`, nor with html. It concerns only to character-encoding issues, so you might want to rephrase and minimize your question to focus on this problem, and this problem only. — barak manos, Dec 20 '16 at 12:13
That page returns (gzipped - i.e. not plain text) `charset=utf-8` Why are you decoding w/ utf-16? — Alex K., Dec 20 '16 at 12:14
I partially take my second comment back. The specific URL is important in attempting to investigate this issue. — barak manos, Dec 20 '16 at 12:29

furas · Accepted Answer · 2016-12-20T13:25:45.233

3

Use requests and you get correct, ungzipped HTML

import requests

r  = requests.get("http://www.stream-urls.de/webradio")
print r.text

EDIT: how to use gzip and StringIO to ungzip data without saving in file

import urllib
import gzip
import StringIO

req  = urllib.urlopen("http://www.stream-urls.de/webradio")

# create file-like object in memory
buf = StringIO.StringIO(req.read())

# create gzip object using file-like object instead of real file on disk
f = gzip.GzipFile(fileobj=buf)

# get data from file
html = f.read()

print html

edited Dec 20 '16 at 13:25

answered Dec 20 '16 at 12:27

furas

134,197
12
106
148

`requests` is not a built-in package (at least not in Python 2.x). Can you please indicate how to `pip` it? – barak manos Dec 20 '16 at 12:30
BTW: [Does python urllib2 automatically uncompress gzip data fetched from webpage?](http://stackoverflow.com/questions/3947120/does-python-urllib2-automatically-uncompress-gzip-data-fetched-from-webpage) - it shows how to use `gzip` module to ungzip data from server. – furas Dec 20 '16 at 12:33

Decoding html file downloaded with urllib

1 Answers1