I'm using BeautifulSoup to parse a bunch of web pages which I downloaded locally using WGet.
I'm reading in the file like this:
file = open(file_name, 'r', encoding='utf-8').read()
soup = BeautifulSoup(file, 'html5lib')
I'm using this soup
object to get text, which I am then writing to a .json file like this:
f.write('"text": "' + str(text.encode('utf-8')) )
However, when I open the .json file I see strings like this:
and\xe2\x80\x94in spite of
He hadn\xe2\x80\x99t shaved in a few days at least
and Michael can go.\xe2\x80\x9d\xc2\xa0 Her voice
I get that these weird characters are not UTF-8 so python doesn't know what to do with them. But I don't know how to fix this.
Thanks for any help.
EDIT: I'm using python3
Also, if I remove the part where I encode the text before I write it, I get the following error:
UnicodeEncodeError: 'ascii' codec can't encode character '\u2019' in position 264: ordinal not in range(128)