UnicodeEncodeError from parsed website (Python3)

Question

I'm trying to parse something from the contents of a website using a Python3 script, and I'm running into a 'UnicodeEncodeError':

import urllib.request

myurl = "https://stackoverflow.com/"
with urllib.request.urlopen(myurl) as url:
    html = url.read()
    print(type(html))
    content = html.decode("UTF-8", "ignore")
    print(type(content))
    print(content)

This produces:

<class 'bytes'>
<class 'str'>
  File "C:\Python3\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u200b' in position 688: character maps to <undefined>

Now, it's not the decoding per se that fails (as the second print call still went through), but the decoded string somehow still contains the unicode characters that should have been ignored?

Have I read the docs on this wrong?

"Have I read the docs on this wrong?" <- Possibly. :-) It sounds as though you were expecting the `decode("utf-8", "ignore")` to remove some Unicode characters, but that's not what it does: it simply ignores portions of the input that can't be decoded as valid Unicode. Anything that _is_ valid Unicode will be kept, including things like zero width spaces. The problem here is coming from printing a zero-width space to the console when the console's encoding (cp1252) doesn't support zero-width spaces. — Mark Dickinson, Sep 07 '18 at 08:51
@MarkDickinson and how do I fix this? (preferably as an answer, so I can accept it, if it works...) — CodingCat, Sep 07 '18 at 10:03
It depends what you want to do: you have a string containing characters that your console encoding doesn't support. Do you just want to print the characters that can be printed and ignore the unprintable ones? Or replace the unprintable ones with something? But I strongly suspect that this is a duplicate question, since getting a `UnicodeEncodeError` on printing is a very common problem, especially on Windows. BTW, are you using a Python version earlier than Python 3.6? If you're in a position to do so, you might try upgrading to Python 3.6. PEP 528 is relevant here. — Mark Dickinson, Sep 07 '18 at 15:12

UnicodeEncodeError from parsed website (Python3)

0 Answers0