UnicodeDecodeError with urllib.request object

Question

When i build code like this:

import urllib.request

with urllib.request.urlopen('http://google.ru') as url:
    print(url.read().decode())

I've got this error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcf in position 102: invalid continuation byte

What way to fix it?

Note that this has *nothing to do with Sublime Text 3*. ST3 runs Python 3, and this issue is the same in any Python 3 interpreter session. — Martijn Pieters, Sep 10 '14 at 07:37

score 4 · Accepted Answer · edited May 23 '17 at 12:29

You are trying to decode data without specifying a codec. The default is used in that case (UTF-8), and that default is wrong for this page. Given the domain name, I'd expect it to be a Cyrillic encoding instead.

If the response includes the right codec, it'll be found with url.info().get_charset(); it'll return None if it wasn't set, at which point the HTML may contain a hint in a <meta> tag instead; you'll have to parse that manually.

The URL you are trying to load does not include a character set in the content type:

>>> import urllib.request
>>> url = urllib.request.urlopen('http://google.ru')
>>> url.info().get_charset() is None
True

If neither a <meta> tag nor a Content-Type characterset have been set, the default is Latin-1; this works for the URL you provided:

print(url.read().decode('latin1'))

However, this is probably not even the correct encoding; as Latin-1 works for all content. You'll likely get a Mochibake instead. In some cases you may need to hardcode; this looks like the CP-1251 encoding (Windows Cyrilic codepage) to me.

If you are planning to parse the HTML, use BeautifulSoup and pass in the bytes content; it'll auto-detect the encoding for you:

import urllib.request
from bs4 import BeautifulSoup

with urllib.request.urlopen('http://google.ru') as url:
    soup = BeautifulSoup(url)

You can tell BeautifulSoup to use a specific encoding with from_encoding if it gets the auto-detection wrong:

with urllib.request.urlopen('http://google.ru') as url:
    soup = BeautifulSoup(url, from_encoding='cp1251')

Demo:

>>> import urllib.request
>>> from bs4 import BeautifulSoup
>>> url = urllib.request.urlopen('http://google.ru')
>>> soup = BeautifulSoup(url, from_encoding='cp1251')
>>> soup.head.meta
<meta content="Поиск информации в интернете: веб страницы, картинки, видео и многое другое." name="description"/>

I must say that I am surprised Google didn't set a proper content-type character set on the response here.

UnicodeDecodeError with urllib.request object

1 Answers1