Issue with parsing special characters in a utf-8 encoded page with bs4

Question

I’m trying to parse a page and I’m having some issue with special characters such as é è à, etc.

According to the Firefox page information tool, the page is encoded in UTF - 8

My code is the following :

import bs4
import requests


url = 'https://www.registreentreprises.gouv.qc.ca/RQEntrepriseGRExt/GR/GR99/GR99A2_05A_PIU_AfficherMessages_PC/ActiEcon.html'

page = requests.get(url)

cae_obj_soup = bs4.BeautifulSoup(page.text, 'lxml', from_encoding='utf-8')
list_all_domain = cae_obj_soup.find_all('th')

for element in list_all_domain:
    print(element.get_text())

The output is :

PÃªche et piÃ©geage
Exploitation forestiÃ¨re

I tried changing the encoding with iso-8859-1 (French encoding) and some other encodings without success. I read several posts on parsing special characters, and they basically states that it’s an issue of selecting the right encoding. Is there a possibility that I can’t decode correctly the special characters on some specific webpage or am I doing something wrong ?

score 1 · Accepted Answer · answered Sep 14 '20 at 20:14

The requests library takes a strict approach to the decoding of web pages. On the other hand, BeautifulSoup has powerful tools for determining the encoding of text. So it's better to pass the raw response from the request to BeautifulSoup, and let BeautifulSoup try to determine the encoding.

>>> r = requests.get('https://www.registreentreprises.gouv.qc.ca/RQEntrepriseGRExt/GR/GR99/GR99A2_05A_PIU_AfficherMessages_PC/ActiEcon.html')
>>> soup = BeautifulSoup(r.content, 'lxml')
>>> list_all_domain = soup.find_all('th')
>>> [e.get_text() for e in list_all_domain]
['Agriculture', "Services relatifs à l'agriculture", 'Pêche et piégeage', ...]

Issue with parsing special characters in a utf-8 encoded page with bs4

1 Answers1