Why does Python recognise this UTF-8 character as two characters rather than one

Question

Some UTF-8 text I'm trying to process has this lovely 4 byte character: \xF0\x9F\x98\xA5

Per this website, it's "disappointed but relieved face": http://apps.timwhitlock.info/emoji/tables/unicode

It appears to me that Python is treating this as two separate characters.

Here's my test code:

mystring = '\xF0\x9F\x98\xA5'.decode('utf-8')

print len(mystring)

print mystring

print len(mystring.encode('utf-8'))

for c in mystring:
    print c

When I print mystring, I get a lovely face. But when I print the length of mystring I get 2.

Incidentally, the reason I'm trying to deal with this is that I need to address 4 byte characters in the string so I can push to a pre-5.5 MySQL database (which only handles 3 byte UTF-8).

I would appreciate help on why Python appears to recognize this as two characters, and also on how to detect 4 byte characters in UTF-8 string.

Thanks.

Possible bug? The decoded string in Python 3 has length 1. (`len(b'\xf0\x9f\x98\xa5'.decode('utf-8'))`) — chepner, Sep 09 '15 at 18:16
I wasn't aware of this, but UCS-2 builds use 2 bytes internally for each character, see: http://stackoverflow.com/questions/12636489/python-convert-4-byte-char-to-avoid-mysql-error-incorrect-string-value. I'm guessing that's the problem — user1379351, Sep 09 '15 at 18:24
I came across this issue when performing AngularJS text validation. 4-byte characters are counted as two – — liteflier, Nov 19 '15 at 19:40

score 6 · Accepted Answer · answered Sep 10 '15 at 03:34

You're using a version of Python which doesn't yet properly count characters above U+FFFF. Some other languages (JAVA, JavaScript) behave like that (you can consider that a bug), newer versions of Python will correctly treat this as one character.

Recognising 4-byte characters is easy, the first byte of the 4 is always of the form 11110xxx (so all values in range(0xf0, 0xf8) ). They represent all code points above U+FFFF.

score 3 · Answer 2 · edited May 23 '17 at 12:15

Based on the comments and answer, here's the code I used to solve my requirement (removing or escaping 4 byte characters in Python that's a UCS-2 build):

import re

def unicodeescape4bytechars(inputstring):

    try:
        pattern = re.compile(u'([\U00010000-\U0010ffff])')
    except re.error:
        # UCS-2 build
        pattern = re.compile(u'([\uD800-\uDBFF][\uDC00-\uDFFF])')

    for match in re.findall(pattern, inputstring):
        inputstring = inputstring.replace(match, match.encode('unicode_escape'))

    return inputstring

def remove4bytechars(inputstring):
    try:
        pattern = re.compile(u'[\U00010000-\U0010ffff]')
    except re.error:
        # UCS-2 build
        pattern = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]')

    return pattern.sub(u'', inputstring)

mystring = 'abcdefg\xF0\x9F\x98\xA5 as;dlf\xF0\x9F\x98\x83kj'.decode('utf-8')

print unicodeescape4bytechars(mystring)
print remove4bytechars(mystring)

I also relied on these two sources:

Warning raised by inserting 4-byte unicode to mysql

Python, convert 4-byte char to avoid MySQL error "Incorrect string value:"

Why does Python recognise this UTF-8 character as two characters rather than one

2 Answers2