Why does Python 2 think these bytes are the microphone emoji but Python 3 doesn't?

Question

I have some data in a database which was inputted by a user as "BTS⚾️>BTS", i.e. "BTS" + the baseball emoji + ">BTS" + the microphone emoji. When I read it from the database, decode it, and print it in Python 2, it displays the emojis correctly. But when I try to decode the same bytes in Python 3, it fails with a UnicodeDecodeError.

The bytes in Python 2:

>>> data
'BTS\xe2\x9a\xbe\xef\xb8\x8f>BTS\xed\xa0\xbc\xed\xbe\xa4'

Decoding these as UTF-8 outputs this unicode string:

>>> 'BTS\xe2\x9a\xbe\xef\xb8\x8f>BTS\xed\xa0\xbc\xed\xbe\xa4'.decode('utf_8')
u'BTS\u26be\ufe0f>BTS\U0001f3a4'

Printing that unicode string on my Mac displays the baseball and microphone emojis:

>>> print u'BTS\u26be\ufe0f>BTS\U0001f3a4'
BTS⚾️>BTS

However in Python 3, decoding the same bytes as UTF-8 gives me an error:

>>> b'BTS\xe2\x9a\xbe\xef\xb8\x8f>BTS\xed\xa0\xbc\xed\xbe\xa4'.decode('utf_8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 13: invalid continuation byte

In particular, it seems to take issue with the last 6 bytes (the microphone emoji):

>>> b'\xed\xa0\xbc\xed\xbe\xa4'.decode('utf_8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte

Furthermore, other tools, like this online hex to Unicode converter, tell me these bytes are not a valid Unicode character:

https://onlineutf8tools.com/convert-bytes-to-utf8?input=ed%20a0%20bc%20ed%20be%20a4

Why do Python 2 and whatever program encoded the user's input think these bytes are the microphone emoji, but Python 3 and other tools do not?

score 5 · Accepted Answer · edited Aug 16 '19 at 20:13

It looks like there are a couple web pages that will help answer your question:

https://bugs.python.org/issue9133 (Relates to Python 2's overly permissive UTF-8 handling)
How to work with surrogate pairs in Python? (Relates to dealing with that permissiveness)

If I decode the bytes you got from Python 2 using Python 3's "surrogatepass" error handler, that is:

b'BTS\xe2\x9a\xbe\xef\xb8\x8f>BTS\xed\xa0\xbc\xed\xbe\xa4'.decode('utf_8',
    errors = 'surrogatepass')

then I get the string 'BTS⚾️>BTS\ud83c\udfa4', where '\ud83c\udfa4' is a surrogate pair that's supposed to stand in for the microphone emogi.

You can get back to the microphone in Python 3 by encoding the string with surrogate pairs as UTF-16 with "surrogate pass" and decoding as UTF-16:

>>> string_as_utf_8 = b'BTS\xe2\x9a\xbe\xef\xb8\x8f>BTS\xed\xa0\xbc\xed\xbe\xa4'.decode('utf_8', errors='surrogatepass')
>>> bytes_as_utf_16 = string_as_utf_8.encode('utf_16', errors='surrogatepass')
>>> string_as_utf_16 = bytes_as_utf_16.decode('utf_16')
>>> print(string_as_utf_16)
BTS⚾️>BTS

Thanks! Through this exercise I learned about two aspects of Unicode that I didn't know about before: variation selectors (bytes 3 through 6 of the baseball emoji above are a variation selector) and surrogate pairs. — Jared, Aug 16 '19 at 19:54

Davide Fazio · Answer 2 · 2019-08-16T18:02:09.827

Try to encode again this bytes u'BTS\u26be\ufe0f>BTS\U0001f3a4' in utf-8 in python 3

text = u'BTS\u26be\ufe0f>BTS\U0001f3a4'
result = text.encode('utf_8')
print(result)
result.decode('utf_8')

the result contains this bytes:

b'BTS\xe2\x9a\xbe\xef\xb8\x8f>BTS\xf0\x9f\x8e\xa4'

there are different from this you have in python 2:

b'BTS\xe2\x9a\xbe\xef\xb8\x8f>BTS\xed\xa0\xbc\xed\xbe\xa4'

but if you decode again the result: b'BTS\xe2\x9a\xbe\xef\xb8\x8f>BTS\xf0\x9f\x8e\xa4' in utf-8 in python 3, you will receive the result you want

In few words, python2 and python3 works in different ways, so you have to save in database the decoded bytes, that are unique.

Why does Python 2 think these bytes are the microphone emoji but Python 3 doesn't?

2 Answers2