I'm trying to parse something from the contents of a website using a Python3 script, and I'm running into a 'UnicodeEncodeError':
import urllib.request
myurl = "https://stackoverflow.com/"
with urllib.request.urlopen(myurl) as url:
html = url.read()
print(type(html))
content = html.decode("UTF-8", "ignore")
print(type(content))
print(content)
This produces:
<class 'bytes'>
<class 'str'>
File "C:\Python3\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u200b' in position 688: character maps to <undefined>
Now, it's not the decoding per se that fails (as the second print call still went through), but the decoded string somehow still contains the unicode characters that should have been ignored?
Have I read the docs on this wrong?