14

Consider the following exchange on IPython:

In [1]: s = u'華袞與緼同歸'

In [2]: len(s)
Out[2]: 8

The correct output should have been 7, but because the fifth of these seven Chinese characters has a high Unicode code-point, it is represented in UTF-8 by a "surrogate pair", rather than just one simple codepoint, and as a result Python thinks it is two characters rather than one.

Even if I use unicodedata, which returns the surrogate pair correctly as a single codepoint (\U00026177), when passed to len() the wrong length is still returned:

In [3]: import unicodedata

In [4]: unicodedata.normalize('NFC', s)
Out[4]: u'\u83ef\u889e\u8207\u7dfc\U00026177\u540c\u6b78'


In [5]: len(unicodedata.normalize('NFC', s))
Out[5]: 8

Without taking drastic steps like recompiling Python for UTF-32, is there a simple way to get the correct length in situations like this?

I'm on IPython 0.13, Python 2.7.2, Mac OS 10.8.2.

brannerchinese
  • 1,909
  • 5
  • 24
  • 40
  • The discussions [here](http://stackoverflow.com/questions/9934752/platform-specific-unicode-semantics-in-python-2-7) and [here](http://stackoverflow.com/questions/6922480/how-to-get-a-reliable-unicode-character-count-in-python) seem relevant. – DSM Oct 16 '12 at 03:36
  • @DSM: Thanks for digging these up. Your first link shows Python compiled for UTF-32 ("wide build"), something I ruled out in my question. In the second, the reply by wberry shows an elaborate piece of code to actually count true characters. My default workaround is like the latter, but I am hoping there exists something built in and more direct. – brannerchinese Oct 16 '12 at 04:11
  • I can't reproduce your result here (Ubuntu box, python 2.7.2). For the unicode u'\u83ef\u889e\u8207\u7dfc\U00026177\u540c\u6b78' I get a length of seven with both len(s) and len(unicode.normalize('NFC', s)) – Vicent Oct 16 '12 at 07:20
  • It's probably highly version-dependent. Python3.3 should deal more gracefully with this, since, by default, it never creates surrogate pairs(even though you can create them by hand). – Bakuriu Oct 16 '12 at 07:40
  • It isn't UTF-8 that represents the non-BMP character by a surrogate pair. It is UTF-16, or rather the hack that Python used in versions < 3.3 on narrow builds. (Well, you _could_ take the surrogate pairs as in UTF-16, and encode each of the two surrogates using UTF-8, but this is explicitly prohibited by RFC 3629 though many UTF-8 implementations do it: it's called [WTF-8](https://en.wikipedia.org/w/index.php?title=UTF-8&oldid=750909440#WTF-8). But the only way a string can get encoded in UTF-8 this way is if it originally came from UTF-16). See chrispy's answer below for a simple solution. – ShreevatsaR Nov 22 '16 at 19:57
  • BTW, even Python 2 on narrow builds does the right thing with UTF-8: for your `s` above, `len(s.encode('utf-8'))` gives 22, which comes from encoding six of the seven characters using 3 bytes each, and the other one using 4 bytes. UTF-8 here is _not_ doing the wrong thing of encoding the surrogate pairs separately (thank god) which would lead to a length of 8*3=24 bytes. – ShreevatsaR Nov 22 '16 at 20:28

3 Answers3

8

I think this has been fixen in 3.3. See:

http://docs.python.org/py3k/whatsnew/3.3.html
http://www.python.org/dev/peps/pep-0393/ (search for wstr_length)

Ecir Hana
  • 10,864
  • 13
  • 67
  • 117
  • Yes. But in 2.7 we are apparently on our own, unless we are using a wide build. It will be a while before I can move to Py3, unfortunately. – brannerchinese Oct 24 '12 at 02:16
  • 1
    I moved to Py3 in February, and (except when I am forced back into 2.7 by libraries such as NLTK) my troubles with surrogate pairs are over. This is indeed now the best solution. – brannerchinese May 09 '13 at 22:33
7

I make a function to do this on Python 2:

SURROGATE_PAIR = re.compile(u'[\ud800-\udbff][\udc00-\udfff]', re.UNICODE)
def unicodeLen(s):
  return len(SURROGATE_PAIR.sub('.', s))

By replacing surrogate pairs with a single character, we 'fix' the len function. On normal strings, this should be pretty efficient: since the pattern won't match, the original string will be returned without modification. It should work on wide (32-bit) Python builds, too, as the surrogate pair encoding will not be used.

Alice Purcell
  • 12,622
  • 6
  • 51
  • 57
3

You can override the len function in Python (see: How does len work?) and add an if statement in it to check for the extra long unicode.

Community
  • 1
  • 1
schilippe
  • 79
  • 3