3

Some UTF-8 text I'm trying to process has this lovely 4 byte character: \xF0\x9F\x98\xA5

Per this website, it's "disappointed but relieved face": http://apps.timwhitlock.info/emoji/tables/unicode

It appears to me that Python is treating this as two separate characters.

Here's my test code:

mystring = '\xF0\x9F\x98\xA5'.decode('utf-8')

print len(mystring)

print mystring

print len(mystring.encode('utf-8'))

for c in mystring:
    print c

When I print mystring, I get a lovely face. But when I print the length of mystring I get 2.

Incidentally, the reason I'm trying to deal with this is that I need to address 4 byte characters in the string so I can push to a pre-5.5 MySQL database (which only handles 3 byte UTF-8).

I would appreciate help on why Python appears to recognize this as two characters, and also on how to detect 4 byte characters in UTF-8 string.

Thanks.

user1379351
  • 723
  • 1
  • 5
  • 18

2 Answers2

6

You're using a version of Python which doesn't yet properly count characters above U+FFFF. Some other languages (JAVA, JavaScript) behave like that (you can consider that a bug), newer versions of Python will correctly treat this as one character.

Recognising 4-byte characters is easy, the first byte of the 4 is always of the form 11110xxx (so all values in range(0xf0, 0xf8) ). They represent all code points above U+FFFF.

roeland
  • 5,349
  • 2
  • 14
  • 28
3

Based on the comments and answer, here's the code I used to solve my requirement (removing or escaping 4 byte characters in Python that's a UCS-2 build):

import re

def unicodeescape4bytechars(inputstring):

    try:
        pattern = re.compile(u'([\U00010000-\U0010ffff])')
    except re.error:
        # UCS-2 build
        pattern = re.compile(u'([\uD800-\uDBFF][\uDC00-\uDFFF])')

    for match in re.findall(pattern, inputstring):
        inputstring = inputstring.replace(match, match.encode('unicode_escape'))

    return inputstring

def remove4bytechars(inputstring):
    try:
        pattern = re.compile(u'[\U00010000-\U0010ffff]')
    except re.error:
        # UCS-2 build
        pattern = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]')

    return pattern.sub(u'', inputstring)

mystring = 'abcdefg\xF0\x9F\x98\xA5 as;dlf\xF0\x9F\x98\x83kj'.decode('utf-8')

print unicodeescape4bytechars(mystring)
print remove4bytechars(mystring)

I also relied on these two sources:

Warning raised by inserting 4-byte unicode to mysql

Python, convert 4-byte char to avoid MySQL error "Incorrect string value:"

Community
  • 1
  • 1
user1379351
  • 723
  • 1
  • 5
  • 18