How does one ignore 4-byte utf-8 characters in Python?

Question

I am trying to experiment with web-scraping in a python program. The html page that I get is in utf-8 format. I am having trouble with the following character: '' I believe it is due to the character taking 4 bytes (encodes to b'\xf0\xa0\x86\xa2'). I also noted that Windows is not friendly to utf-8, and I am a Windows user.

I have tried to find a way to parse the text and remove the bad 4-byte character when it comes up for several hours without success. Since the character is part of a full line of text, I would like to parse through the line and remove only the undecodable character.

def TryDecode(toParse):
    try:
        result = toParse.decode('utf-8', 'ignore') #No exception
    except UnicodeEncodeError:
        result = 'error'
    return result

badutf = b'  <li ...>\xf0\xa0\x86\xa2</li>\r\n'
res = TryDecode(badutf)
print("I see this")
print(res) # UnicodeEncodeError
print("I do not see this.")

Expected results: Error to be thrown in the try block or not at all. Actual results: No error until the second print statement. Note: If I include the '' character in my script, it becomes impossible to run it from the IDE as well.

Edit: Thanks to helpful advice, I understand the problem now. Here is a solution if anyone else runs into a similar issue:

UCSTWOMAX = 65536 # Max value for UCS-2 formatting
def TryDecode(toParse):
    try:
        parsed = toParse.decode('utf-8', 'ignore')
        result = ''
        for c in parsed:
            if ord(c) < UCSTWOMAX:
                result += c
    except UnicodeEncodeError:
        result = 'error'
    return result

badutf = b'  <li ...>\xf0\xa0\x86\xa2</li>\r\n'
res = TryDecode(badutf)
print(res)
print("I see this now.")

You may find e.g. /questions/6344853/python-unicode-in-windows-terminal-encoding-used helpful. — Karl Knechtel, Apr 29 '19 at 10:41

score 3 · Accepted Answer · answered Apr 29 '19 at 10:39

Your byte sequence b'\xf0\xa0\x86\xa2' decodes to '\U000201a2'. This is not a bad codepoint but it does lie outside the basic multilingual plane which means that much software (including Tk, and applications like IDLE that use Tk) will have trouble displaying it. This is because Tk (despite claims to the contrary) doesn't fully support UTF-8, but only its predecessor standard UCS-2 (which is UTF-8 but without characters outside the BMP).

Decode as UTF-8 the way you are doing:

res = TryDecode(badutf)

then delete the character your software has trouble displaying:

fixed = res.replace('\U000201a2','')

As a side note, Windows is not unfriendly to UTF-8. It was the first filesystem to support Unicode (around 20 years ago).

score 3 · Answer 2 · answered Apr 29 '19 at 16:56

If you are getting a UnicodeEncodeError on print, you must not be using Python 3.6+ on Windows. That version and later use Unicode console APIs. You may see a substitution character if the font can't support the character, but the characters printed when cut and pasted, will show correct in applications that support the characters.

Example:

What I see in the Windows terminal:

That same text copied to StackOverflow (Notepad/Notepad++ work, too):

Python 3.6.8 (tags/v3.6.8:3c6b436a57, Dec 24 2018, 00:16:47) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> s = '\U000201a2'
>>> print(s)

If you just need to filter characters outside the BMP, you can use this after decoding the string:

>>> s = "text\U000201a2more text"
>>> s = ''.join(x for x in s if ord(x) < 65536)
>>> s
'textmore text'

score 0 · Answer 3 · answered Apr 29 '19 at 10:47

I think this post can solve your problem: stackoverflow question 31805474 - encode error

As you pointed out, the problem is related to windows terminal (if you try to run you code in jupyter it will correctly print '' with no errors). Your try clause is working correctly as it can handle the string without problem; the Traceback is generated by print() itself (...\lib\encodings\cp850.py) which cannot handle the character.

The answer in the link will avoid the Traceback, but the character will be rendered through a sequence of other chars ( enter code here<li ...>ð †¢</li> )

I saw that solution over here as well: https://stackoverflow.com/questions/388490/how-to-use-unicode-characters-in-windows-command-line It seems like there are some risks that come along with it, so I wanted to try to find a different way around it. — Spiros, Apr 29 '19 at 13:09

How does one ignore 4-byte utf-8 characters in Python?

3 Answers3