I’m trying to parse a page and I’m having some issue with special characters such as é è à, etc.
According to the Firefox page information tool, the page is encoded in UTF - 8
My code is the following :
import bs4
import requests
url = 'https://www.registreentreprises.gouv.qc.ca/RQEntrepriseGRExt/GR/GR99/GR99A2_05A_PIU_AfficherMessages_PC/ActiEcon.html'
page = requests.get(url)
cae_obj_soup = bs4.BeautifulSoup(page.text, 'lxml', from_encoding='utf-8')
list_all_domain = cae_obj_soup.find_all('th')
for element in list_all_domain:
print(element.get_text())
The output is :
Pêche et piégeage
Exploitation forestière
I tried changing the encoding with iso-8859-1
(French encoding) and some other encodings without success. I read several posts on parsing special characters, and they basically states that it’s an issue of selecting the right encoding. Is there a possibility that I can’t decode correctly the special characters on some specific webpage or am I doing something wrong ?