I'm using urllib2 to open a russian website and extract text from it. However, instead of coming out as "Беллона" it's coming out as "Áåëëîíà". What's the easiest way to get around this?
Asked
Active
Viewed 602 times
2 Answers
2
Figure out which encoding the webpage uses (probably utf-8 or ISO 8859-5), and convert your text to unicode like this:
ustring = unicode(read_string, encoding=...)
If you need to determine the encoding of a webpage dynamically, see this SO answer.
-
Thanks! 'windows-1251' was the encoding that ended up working. – maxko87 Mar 11 '12 at 19:08
1
Try this:
doc = urllib.open('http://yandex.ru').read()
doc = doc.decode('utf-8')
That's all ;)

Denis
- 7,127
- 8
- 37
- 58