Python problem with accents when decoding from base64

Question

I'm getting data from a website and this is an example of a sentence I retrieved : PHA+Q29ycmlnJmVhY3V0ZTtzIGV4ZXJjaWNlcyBlbnRyYWluZW1lbnQgY2hhcGl0cmUgbW91dmVtZW50IGV0IGZvcmNlczwvcD4K

The sentence is encoded with base64 so I thought about decoding it and then encoding it back to utf-8 with python :

import base64

sentence = "PHA+Q29ycmlnJmVhY3V0ZTtzIGV4ZXJjaWNlcyBlbnRyYWluZW1lbnQgY2hhcGl0cmUgbW91dmVtZW50IGV0IGZvcmNlczwvcD4K"
base64.b64decode(sentence).decode("utf-8")

The problem is that instead of looking like this: "Corrigés exercices entrainement chapitre mouvement et forces", it looks like this: "Corrigés exercices entrainement chapitre mouvement et forces".

As you can see, the accents are completely messed up.

I'm using python 3

I do not have access to the decoded sentence using the API (I only have the base64 encoded one).

Thanks for you help.

You are correctly getting the plaintext from the Base64. However, that plaintext contains HTML entities; see [Decode HTML entities in Python string?](https://stackoverflow.com/questions/2087370/decode-html-entities-in-python-string). — Amadan, Jan 21 '22 at 03:58

score 0 · Answer 1 · answered Jan 21 '22 at 04:08

In case someone doesn't know about HTML entities (just like me) and needs the answer.

Thanks to Amadan's comment, I just learned that the strange thing I got instead of my accent was called an HTML entity.

In order to get back my accent, I needed to unescape it :

import html

print(html.unescape("Corrig&eacute;s exercices entrainement chapitre mouvement et forces"))

>> Corrigés exercices entrainement chapitre mouvement et forces

Python problem with accents when decoding from base64

1 Answers1