0

I'm getting data from a website and this is an example of a sentence I retrieved : PHA+Q29ycmlnJmVhY3V0ZTtzIGV4ZXJjaWNlcyBlbnRyYWluZW1lbnQgY2hhcGl0cmUgbW91dmVtZW50IGV0IGZvcmNlczwvcD4K

The sentence is encoded with base64 so I thought about decoding it and then encoding it back to utf-8 with python :

import base64

sentence = "PHA+Q29ycmlnJmVhY3V0ZTtzIGV4ZXJjaWNlcyBlbnRyYWluZW1lbnQgY2hhcGl0cmUgbW91dmVtZW50IGV0IGZvcmNlczwvcD4K"
base64.b64decode(sentence).decode("utf-8")

The problem is that instead of looking like this: "Corrigés exercices entrainement chapitre mouvement et forces", it looks like this: "Corrigés exercices entrainement chapitre mouvement et forces".

As you can see, the accents are completely messed up.

I'm using python 3

I do not have access to the decoded sentence using the API (I only have the base64 encoded one).

Thanks for you help.

  • You are correctly getting the plaintext from the Base64. However, that plaintext contains HTML entities; see [Decode HTML entities in Python string?](https://stackoverflow.com/questions/2087370/decode-html-entities-in-python-string). – Amadan Jan 21 '22 at 03:58

1 Answers1

0

In case someone doesn't know about HTML entities (just like me) and needs the answer.

Thanks to Amadan's comment, I just learned that the strange thing I got instead of my accent was called an HTML entity.

In order to get back my accent, I needed to unescape it :

import html

print(html.unescape("Corrigés exercices entrainement chapitre mouvement et forces"))

>> Corrigés exercices entrainement chapitre mouvement et forces