Some UTF-8 text I'm trying to process has this lovely 4 byte character: \xF0\x9F\x98\xA5
Per this website, it's "disappointed but relieved face": http://apps.timwhitlock.info/emoji/tables/unicode
It appears to me that Python is treating this as two separate characters.
Here's my test code:
mystring = '\xF0\x9F\x98\xA5'.decode('utf-8')
print len(mystring)
print mystring
print len(mystring.encode('utf-8'))
for c in mystring:
print c
When I print mystring, I get a lovely face. But when I print the length of mystring I get 2.
Incidentally, the reason I'm trying to deal with this is that I need to address 4 byte characters in the string so I can push to a pre-5.5 MySQL database (which only handles 3 byte UTF-8).
I would appreciate help on why Python appears to recognize this as two characters, and also on how to detect 4 byte characters in UTF-8 string.
Thanks.