0

I am trying to download a webpage using urlib2 in python.

response = urlopen(url, timeout=10)
html = response.read()

html[170:198]
print html[170:198]

But the 'á' character present in it is encoded as '\u0e41', which is noThai Character Sara Ae, so far as I understand.

Cadeia Acion\u0e41ria da Empresa 
Cadeia Acionแria da Empresa

The output of the print command should be:

Cadeia Acionária da Empresa 

Can someone tell me what am I doing wrong?

t.pimentel
  • 1,465
  • 3
  • 17
  • 24
  • related: [A good way to get the charset/encoding of an HTTP response in Python](http://stackoverflow.com/q/14592762/4279) – jfs Jul 02 '15 at 16:06

1 Answers1

0

I found out what I was doing wrong. The webpage encoding was ISO-8859-1 and I wasn't decoding it at download time. Just adding the correct encoding after downloading it made everything work fine.

response = urlopen(url, timeout=10)
html = response.read()
html = unicode(html, 'ISO-8859-1')

html[170:198]
print html[170:198]

Now the string printed is:

Cadeia Acion\xe1ria da Empresa
Cadeia Acionária da Empresa 

With the correct encoding.

t.pimentel
  • 1,465
  • 3
  • 17
  • 24