2

I am attempting to read in a large set of .htm files with Python. To do so I am using the following:

HtmlFile = codecs.open(file, 'r')
text = BeautifulSoup(HtmlFile.read()).text

However, this results in the following error:

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 411: 
character maps to <undefined>

So, I tried encoding with utf-8 like so:

HtmlFile = codecs.open(file, 'r', encoding='utf-8')
text = BeautifulSoup(HtmlFile.read()).text

And then I got this error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 4565: 
invalid start byte

I tried following the advice here, but it was not helping. Any help would be greatly appreciated!

Stephen Strosko
  • 597
  • 1
  • 5
  • 18

1 Answers1

4

I've done a bit of research and it's an issue with a Microsoft generated file using the CP1252 encoding, however there are some things that are not picked up correctly. Given the following:

<meta http-equiv=Content-Type content="text/html; charset=windows-1252">
<meta name=Generator content="Microsoft Word 15 (filtered)">

in your html file this seems more than likely.

According to this answer, if you use Latin-1 encoding for that example it could help:

HtmlFile = codecs.open(file, 'r', encoding='latin-1')
text = BeautifulSoup(HtmlFile.read()).text

Let me know if this works. Beware that Latin-1 does not have all the characters that the Microsoft encodings have though.

AER
  • 1,549
  • 19
  • 37
  • Thank you so much! This seems to solve the issue for me. I absolutely hate encoding errors. Not sure why they encoded their files like so. – Stephen Strosko Dec 18 '18 at 00:23
  • Yeah, they're tedious as. This may work in this case but if you're ingesting a heap of websites, you may need to check the encoding and process accordingly. Good luck! – AER Dec 18 '18 at 00:24