0

I'm trying to parse HTML page that I saved to my computer(windows 10)

from bs4 import BeautifulSoup

with open("res/JLPT N5 vocab list.html", "r", encoding="utf8") as f:
    soup = BeautifulSoup(f, "html.parser")
tables = soup.find_all("table")
sectable= tables[1]
for tr in sectable.contents[1:]:
    if tr.name == "tr":
        try:
            print(tr.td.a.get_text())
        except(AttributeError):
            continue

It should print all of japanese words in first column but error was raised at print(tr.td.a.get_text()) said UnicodeEncodeError: 'charmap" codec can't encode character in position 0-1: character maps to (undefined) so, how can I solve this error?

witoong623
  • 1,179
  • 1
  • 15
  • 32

1 Answers1

0

Finally, I solved it, according to Beautiful Soup Documentatioin's Miscellaneous.

UnicodeEncodeError: 'charmap' codec can't encode character u'\xfoo' in position bar (or just about any other UnicodeEncodeError) - This is not a problem with Beautiful Soup. This problem shows up in two main situations. First, when you try to print a Unicode character that your console doesn’t know how to display. (See this page on the Python wiki for help.) Second, when you’re writing to a file and you pass in a Unicode character that’s not supported by your default encoding. In this case, the simplest solution is to explicitly encode the Unicode string into UTF-8 with u.encode("utf8").

In my case, it because I tried to print a Unicode character that my console doesn't know how to display it.
So, I enabled TrueType font for console , changed system locale to Japanese(so that console encode was changed and can choose font that support japanese for console) and then changed console font to MSコシック(this font appeared after I changed system locale).
If I want to write it to file, I just open file and specify encoding to UTF-8.

Community
  • 1
  • 1
witoong623
  • 1,179
  • 1
  • 15
  • 32