Russian symbols in Python output corrupted (ENCODING)

Question

I parsed a HTML document and have Russian text in it. When I'm trying to print it in Python, I get this:

ÐÐ»ÑÐ±Ð½Ð¸ÑÐ½ÑÐ¹ Ð½Ð¾Ð²Ð¾Ð³Ð¾Ð´Ð½Ð¸Ð¹ Ð¿ÑÐ½Ñ

I tried to decode it and I get ISO-8859-1 encoding. I'm trying to decode it like that:

print drink_name.decode('iso8859-1')

But I get an error. How can I print this text, or encode it in Unicode?

Please include the code that you use to parse the HTML document in the first place, so we can help you not make this mistake in the first place. — Martijn Pieters, Nov 11 '14 at 16:46
The answer can be [here](https://stackoverflow.com/questions/9233027/unicodedecodeerror-charmap-codec-cant-decode-byte-x-in-position-y-character). It helped me. — Rashevskii, Mar 22 '22 at 07:43

Martijn Pieters · Answer 1 · 2014-11-11T16:54:08.600

You have a Mojibake; UTF-8 bytes decoded as Latin-1 or CP1251 in this case.

You can repair it by reversing the process:

>>> print u'ÐÐ»ÑÐ±Ð½Ð¸ÑÐ½ÑÐ¹ Ð½Ð¾Ð²Ð¾Ð³Ð¾Ð´Ð½Ð¸Ð¹ Ð¿ÑÐ½Ñ'.encode('latin1').decode('utf8')
Клубничный новогодний пунш

(I had to copy the string from the original post source to capture all the non-printable bytes in the Mojibake).

The better method would be to not incorrectly decoding in the first place. You decoded the original text with the wrong encoding, use UTF-8 as the codec instead.

If you used requests to download the page, do not use response.text in this case; if the server failed to specific codec then the HTTP RFC default is to use Latin-1, but HTML documents often embed the encoding in a <meta> header instead. Leave decoding in such cases to your parser, like BeautifulSoup:

response = requests.get(url)
soup = BeautifulSoup(response.content)  # pass in undecoded bytes

yes, this works. Thanks. I'll accept the answer in 10 minutes. — aaaapppp, Nov 11 '14 at 16:48
print drink_name.encode('latin1') even this works good. Also, how do you know that this is a Mojibake? — aaaapppp, Nov 11 '14 at 16:49
@aaaapppp: that only works if your terminal is configured to UTF-8. You cannot rely on that always. — Martijn Pieters, Nov 11 '14 at 16:49

Russian symbols in Python output corrupted (ENCODING)

1 Answers1

Linked