urllib2 reads web page with wrong unicode

Question

I am trying to download a webpage using urlib2 in python.

response = urlopen(url, timeout=10)
html = response.read()

html[170:198]
print html[170:198]

But the 'á' character present in it is encoded as '\u0e41', which is noThai Character Sara Ae, so far as I understand.

Cadeia Acion\u0e41ria da Empresa 
Cadeia Acionแria da Empresa

The output of the print command should be:

Cadeia Acionária da Empresa

Can someone tell me what am I doing wrong?

related: [A good way to get the charset/encoding of an HTTP response in Python](http://stackoverflow.com/q/14592762/4279) — jfs, Jul 02 '15 at 16:06

score 0 · Answer 1 · answered Jul 01 '15 at 18:24

I found out what I was doing wrong. The webpage encoding was ISO-8859-1 and I wasn't decoding it at download time. Just adding the correct encoding after downloading it made everything work fine.

response = urlopen(url, timeout=10)
html = response.read()
html = unicode(html, 'ISO-8859-1')

html[170:198]
print html[170:198]

Now the string printed is:

Cadeia Acion\xe1ria da Empresa
Cadeia Acionária da Empresa

With the correct encoding.

urllib2 reads web page with wrong unicode

1 Answers1