0

I'm using urllib2 to open a russian website and extract text from it. However, instead of coming out as "Беллона" it's coming out as "Áåëëîíà". What's the easiest way to get around this?

maxko87
  • 2,892
  • 4
  • 28
  • 43

2 Answers2

2

Figure out which encoding the webpage uses (probably utf-8 or ISO 8859-5), and convert your text to unicode like this:

ustring = unicode(read_string, encoding=...)

If you need to determine the encoding of a webpage dynamically, see this SO answer.

Community
  • 1
  • 1
alexis
  • 48,685
  • 16
  • 101
  • 161
1

Try this:

doc = urllib.open('http://yandex.ru').read()
doc = doc.decode('utf-8')

That's all ;)

Denis
  • 7,127
  • 8
  • 37
  • 58