You are trying to decode data without specifying a codec. The default is used in that case (UTF-8), and that default is wrong for this page. Given the domain name, I'd expect it to be a Cyrillic encoding instead.
If the response includes the right codec, it'll be found with url.info().get_charset()
; it'll return None
if it wasn't set, at which point the HTML may contain a hint in a <meta>
tag instead; you'll have to parse that manually.
The URL you are trying to load does not include a character set in the content type:
>>> import urllib.request
>>> url = urllib.request.urlopen('http://google.ru')
>>> url.info().get_charset() is None
True
If neither a <meta>
tag nor a Content-Type
characterset have been set, the default is Latin-1
; this works for the URL you provided:
print(url.read().decode('latin1'))
However, this is probably not even the correct encoding; as Latin-1 works for all content. You'll likely get a Mochibake instead. In some cases you may need to hardcode; this looks like the CP-1251 encoding (Windows Cyrilic codepage) to me.
If you are planning to parse the HTML, use BeautifulSoup and pass in the bytes
content; it'll auto-detect the encoding for you:
import urllib.request
from bs4 import BeautifulSoup
with urllib.request.urlopen('http://google.ru') as url:
soup = BeautifulSoup(url)
You can tell BeautifulSoup to use a specific encoding with from_encoding
if it gets the auto-detection wrong:
with urllib.request.urlopen('http://google.ru') as url:
soup = BeautifulSoup(url, from_encoding='cp1251')
Demo:
>>> import urllib.request
>>> from bs4 import BeautifulSoup
>>> url = urllib.request.urlopen('http://google.ru')
>>> soup = BeautifulSoup(url, from_encoding='cp1251')
>>> soup.head.meta
<meta content="Поиск информации в интернете: веб страницы, картинки, видео и многое другое." name="description"/>
I must say that I am surprised Google didn't set a proper content-type character set on the response here.