I am attempting to read in a large set of .htm
files with Python. To do so I am using the following:
HtmlFile = codecs.open(file, 'r')
text = BeautifulSoup(HtmlFile.read()).text
However, this results in the following error:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 411:
character maps to <undefined>
So, I tried encoding with utf-8
like so:
HtmlFile = codecs.open(file, 'r', encoding='utf-8')
text = BeautifulSoup(HtmlFile.read()).text
And then I got this error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 4565:
invalid start byte
I tried following the advice here, but it was not helping. Any help would be greatly appreciated!