urllib encoding issues

Question

I'm developing a webcrawler to automatically download some documents on a brazilian website. And it uses some unknown encoding (no charset defined in the head tag).

With some very very little effort people can read the documents. But the real problem is, the page listing the documents uses links with urls containing accentuated characters. But, without knowing the encoding of the page, when I retrieve it from urllib2.urlopen, the characters are all messed up.

e.g. Í characters come as Cyrillic capital letter E.

I'm using BeautifulSoup and prettify doesn't work since urllib2 already returns the document with the bad characters.

And one more thing: soup.originalEncoding returns None.

How can I set the urllib2.urlopen to either recognize the charset or set an "expected encoding" so it returns the characters as it is displayed on the browser?

How many brazilian encodings can thier be? cp860? http://docs.python.org/library/codecs.html?highlight=codecs#standard-encodings — monkut, Aug 16 '12 at 13:27

score 2 · Accepted Answer · edited May 23 '17 at 12:24

2

The character set can be retrieved from the header. I would give you the code I use, but it is derived from How to download any(!) webpage with correct charset in python?. And, he does a way better job of explaining the process. So, I will just point you there.

edited May 23 '17 at 12:24

Community

1
1

answered Aug 16 '12 at 13:26

BigHandsome

4,843
5
23
30

The solution passed on the link really worked. I tried setting fromEncoding parameter to BeautifulSoup constructor before, but it didn't work. But then I noticed that in bs4, they changed it to from_encoding, a notation of more common use in python, and it worked just fine. Additionally I used the following solution to properly conver the url to ascii: http://stackoverflow.com/questions/804336/best-way-to-convert-a-unicode-url-to-ascii-utf-8-percent-escaped-in-python Thanks a lot! – Ken Aug 16 '12 at 14:58

urllib encoding issues

1 Answers1