I'm developing a webcrawler to automatically download some documents on a brazilian website. And it uses some unknown encoding (no charset defined in the head tag).
With some very very little effort people can read the documents. But the real problem is, the page listing the documents uses links with urls containing accentuated characters. But, without knowing the encoding of the page, when I retrieve it from urllib2.urlopen, the characters are all messed up.
e.g. Í
characters come as Cyrillic capital letter E
.
I'm using BeautifulSoup and prettify doesn't work since urllib2 already returns the document with the bad characters.
And one more thing: soup.originalEncoding
returns None
.
How can I set the urllib2.urlopen
to either recognize the charset or set an "expected encoding" so it returns the characters as it is displayed on the browser?