Gettin non-english text from html doc

Question

I'm trying to get a title of html document in python, but getting weird symbols. I guess that's because of encoding, but the html doc in utf-8 encoding. Is there any way I can get normal letters?

Here is code and what am I getting:

from bs4 import BeautifulSoup

 with open("index.html") as file:
     src = file.read()


soup = BeautifulSoup(src, "lxml")

title = soup.title.text

print(title)

Р“Р»Р°РІРЅР°СЏ СЃС‚СЂР°РЅРёС†Р°

Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. — Community, Jul 19 '22 at 07:42
https://stackoverflow.com/questions/491921/unicode-utf-8-reading-and-writing-to-files-in-python — gre_gor, Jul 19 '22 at 07:45

score -1 · Accepted Answer · answered Jul 19 '22 at 07:47

-1

You need to specify an encoding type when opening the file:

 with open("index.html", encoding='utf-8') as file:
     src = file.read()

answered Jul 19 '22 at 07:47

Xiddoc

3,369
3
11
37

1

yeah thanks, that helped, feeling kind of stupid now, because of how simple the answer was – zetparson Jul 19 '22 at 07:49
_Learning is a lifelong process_ ~ keep trying and you will keep getting better :) – Xiddoc Jul 19 '22 at 07:52

Gettin non-english text from html doc

1 Answers1