python unicode error : why do I keep getting this caracters although I used encode(utf-8)?

Question

for p in articles2:
    url = p.find('a')['href']
    title = p.find('h3').get_text().strip().encode("utf-8")
    print(title)

OUTPUT:

c3\xa9gie de d\xc3\xa9fense active et pr\xc3\xa9ventive\xc2\xbb'

b'Zoom sur la course effr\xc3\xa9n\xc3\xa9e pour trouver un vaccin'

b'On vous le dit'

b'\xc3\x89dition du jour (PDF)'

b'Son port est d\xc3\xa9sormais obligatoire : Le prix du masque plafonn\xc3\xa9'

b'Baisse de 20% des prix des produits agricoles' .....

Kindly share sample input data and expected output in copy pastable format — Anshul, May 23 '20 at 17:21
What do you want to accomplish? The output is UTF-8-encoded, and a `bytes` object. If you want to output strings. don't encode. — wastl, May 23 '20 at 17:28
Those are utf-8 encoded byte strings which is the normal output of `.encode('utf-8')`. If I do `b'Zoom sur la course effr\xc3\xa9n\xc3\xa9e pour trouver un vaccin'.decode('utf-8')` I get `'Zoom sur la course effrénée pour trouver un vaccin'`. Encdoing to byte string is good for saving to a file or sending to the network but its not good for human viewing. — tdelaney, May 23 '20 at 17:46

score 0 · Answer 1 · answered May 23 '20 at 17:24

0

Try a different encoding, it seems this characters are Latin-1.

You can find more encodings here

answered May 23 '20 at 17:24

Petru Tanas

1,087
1
12
36

xaander1 · Answer 2 · 2020-05-25T20:51:48.837

Use split() and join to translate the characters.

i.e "Zoom sur la course effr\xc3\xa9n\xc3\xa9e pour trouver un vaccin" will be 'Zoom sur la course effrÃ©nÃ©e pour trouver un vaccin' after join and split()

Then encode it to ascii ignoring errors 'ignore' and decode it to utf-8 this is in order to remove the special characters such as Ã©

Should look like:

"".join(the_text_to_clean.strip()).encode('ascii', 'ignore').decode("utf-8")

How it applies in your code

for p in articles2:
   url = p.find('a')['href']
   title = p.find('h3').get_text()
   title = "".join(title.strip()).encode('ascii', 'ignore').decode("utf-8") #clean title
   print(title)

Please edit your answer to explain how this answers the question. — Dragonthoughts, May 25 '20 at 20:16

python unicode error : why do I keep getting this caracters although I used encode(utf-8)?

2 Answers2

Linked