How to keep html entity such as `☒` intact in beautiful soup?

Question

My goal is to read an html document using beautiful soup, add ids to some tags and write the html back to file.

The html document has html entities such as &#9746 representing ☒. When I create a beautiful soup object, the html entity gets converted to ☒. When I write the soup back to html using str(soup), the html file contains ☒ instead of &#9746. Opening this in a browser yields â˜’ when I want ☒.

I tried using str(soup.encode(formatter='html')), where it did convert to UTF-8 encoding, but the html in the browser shows \xe2\x98\x92.

I'm guessing there is something simple that I'm missing. Any thoughts, on how to keep the special characters in the original document intact after processing it in beautiful soup?

Is there a `` tag specifying the used encoding? If not, maybe adding one would help. — Andrej Kesely, Jan 18 '23 at 20:59
I need process large number of files. Some do but most of them don't. — kyc12, Jan 18 '23 at 21:25

score 0 · Answer 1 · answered Jan 18 '23 at 21:37

I figured it out. It is quite simple actually as stated in this answer.

Just have to use encoding='utf-8'. Answer from the link:

from bs4 import BeautifulSoup

a=BeautifulSoup('<p class="t5">&#9746; &#x20b9; 10,000 or $ 133.46</p>')

with open(filename,'w', encoding='utf-8') as infile:
    infile.write(str(a))  # OR infile.write(a.prettify())

How to keep html entity such as `☒` intact in beautiful soup?

1 Answers1