I have some data in a database which was inputted by a user as "BTS⚾️>BTS", i.e. "BTS" + the baseball emoji + ">BTS" + the microphone emoji. When I read it from the database, decode it, and print it in Python 2, it displays the emojis correctly. But when I try to decode the same bytes in Python 3, it fails with a UnicodeDecodeError
.
The bytes in Python 2:
>>> data
'BTS\xe2\x9a\xbe\xef\xb8\x8f>BTS\xed\xa0\xbc\xed\xbe\xa4'
Decoding these as UTF-8 outputs this unicode string:
>>> 'BTS\xe2\x9a\xbe\xef\xb8\x8f>BTS\xed\xa0\xbc\xed\xbe\xa4'.decode('utf_8')
u'BTS\u26be\ufe0f>BTS\U0001f3a4'
Printing that unicode string on my Mac displays the baseball and microphone emojis:
>>> print u'BTS\u26be\ufe0f>BTS\U0001f3a4'
BTS⚾️>BTS
However in Python 3, decoding the same bytes as UTF-8 gives me an error:
>>> b'BTS\xe2\x9a\xbe\xef\xb8\x8f>BTS\xed\xa0\xbc\xed\xbe\xa4'.decode('utf_8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 13: invalid continuation byte
In particular, it seems to take issue with the last 6 bytes (the microphone emoji):
>>> b'\xed\xa0\xbc\xed\xbe\xa4'.decode('utf_8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte
Furthermore, other tools, like this online hex to Unicode converter, tell me these bytes are not a valid Unicode character:
https://onlineutf8tools.com/convert-bytes-to-utf8?input=ed%20a0%20bc%20ed%20be%20a4
Why do Python 2 and whatever program encoded the user's input think these bytes are the microphone emoji, but Python 3 and other tools do not?