Encoding Issues when reading .htm files with Python

Question

I am attempting to read in a large set of .htm files with Python. To do so I am using the following:

HtmlFile = codecs.open(file, 'r')
text = BeautifulSoup(HtmlFile.read()).text

However, this results in the following error:

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 411: 
character maps to <undefined>

So, I tried encoding with utf-8 like so:

HtmlFile = codecs.open(file, 'r', encoding='utf-8')
text = BeautifulSoup(HtmlFile.read()).text

And then I got this error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 4565: 
invalid start byte

I tried following the advice here, but it was not helping. Any help would be greatly appreciated!

Why are you using `codecs.open()` instead of plain `open()`? — John Gordon, Dec 18 '18 at 00:03
I thought it would help. I still receive the errors with just open() — Stephen Strosko, Dec 18 '18 at 00:05
Sadly that does not help. I included a GoogleDrive link to a sample file above. — Stephen Strosko, Dec 18 '18 at 00:12

AER · Accepted Answer · 2018-12-18T00:20:22.543

4

I've done a bit of research and it's an issue with a Microsoft generated file using the CP1252 encoding, however there are some things that are not picked up correctly. Given the following:

<meta http-equiv=Content-Type content="text/html; charset=windows-1252">
<meta name=Generator content="Microsoft Word 15 (filtered)">

in your html file this seems more than likely.

According to this answer, if you use Latin-1 encoding for that example it could help:

HtmlFile = codecs.open(file, 'r', encoding='latin-1')
text = BeautifulSoup(HtmlFile.read()).text

Let me know if this works. Beware that Latin-1 does not have all the characters that the Microsoft encodings have though.

edited Dec 18 '18 at 00:20

answered Dec 18 '18 at 00:14

AER

1,549
19
37

Thank you so much! This seems to solve the issue for me. I absolutely hate encoding errors. Not sure why they encoded their files like so. – Stephen Strosko Dec 18 '18 at 00:23
Yeah, they're tedious as. This may work in this case but if you're ingesting a heap of websites, you may need to check the encoding and process accordingly. Good luck! – AER Dec 18 '18 at 00:24

Encoding Issues when reading .htm files with Python

1 Answers1