I am trying to experiment with web-scraping in a python program. The html page that I get is in utf-8 format. I am having trouble with the following character: '' I believe it is due to the character taking 4 bytes (encodes to b'\xf0\xa0\x86\xa2'). I also noted that Windows is not friendly to utf-8, and I am a Windows user.
I have tried to find a way to parse the text and remove the bad 4-byte character when it comes up for several hours without success. Since the character is part of a full line of text, I would like to parse through the line and remove only the undecodable character.
def TryDecode(toParse):
try:
result = toParse.decode('utf-8', 'ignore') #No exception
except UnicodeEncodeError:
result = 'error'
return result
badutf = b' <li ...>\xf0\xa0\x86\xa2</li>\r\n'
res = TryDecode(badutf)
print("I see this")
print(res) # UnicodeEncodeError
print("I do not see this.")
Expected results: Error to be thrown in the try block or not at all. Actual results: No error until the second print statement. Note: If I include the '' character in my script, it becomes impossible to run it from the IDE as well.
Edit: Thanks to helpful advice, I understand the problem now. Here is a solution if anyone else runs into a similar issue:
UCSTWOMAX = 65536 # Max value for UCS-2 formatting
def TryDecode(toParse):
try:
parsed = toParse.decode('utf-8', 'ignore')
result = ''
for c in parsed:
if ord(c) < UCSTWOMAX:
result += c
except UnicodeEncodeError:
result = 'error'
return result
badutf = b' <li ...>\xf0\xa0\x86\xa2</li>\r\n'
res = TryDecode(badutf)
print(res)
print("I see this now.")