0

My goal is to read an html document using beautiful soup, add ids to some tags and write the html back to file.

The html document has html entities such as &#9746 representing . When I create a beautiful soup object, the html entity gets converted to . When I write the soup back to html using str(soup), the html file contains instead of &#9746. Opening this in a browser yields ☒ when I want .

I tried using str(soup.encode(formatter='html')), where it did convert to UTF-8 encoding, but the html in the browser shows \xe2\x98\x92.

I'm guessing there is something simple that I'm missing. Any thoughts, on how to keep the special characters in the original document intact after processing it in beautiful soup?

kyc12
  • 349
  • 2
  • 15

1 Answers1

0

I figured it out. It is quite simple actually as stated in this answer.

Just have to use encoding='utf-8'. Answer from the link:

from bs4 import BeautifulSoup

a=BeautifulSoup('<p class="t5">&#9746; &#x20b9; 10,000 or $ 133.46</p>')

with open(filename,'w', encoding='utf-8') as infile:
    infile.write(str(a))  # OR infile.write(a.prettify())
kyc12
  • 349
  • 2
  • 15