Python: getting correct string length when it contains surrogate pairs

Question

Consider the following exchange on IPython:

In [1]: s = u'華袞與緼同歸'

In [2]: len(s)
Out[2]: 8

The correct output should have been 7, but because the fifth of these seven Chinese characters has a high Unicode code-point, it is represented in UTF-8 by a "surrogate pair", rather than just one simple codepoint, and as a result Python thinks it is two characters rather than one.

Even if I use unicodedata, which returns the surrogate pair correctly as a single codepoint (\U00026177), when passed to len() the wrong length is still returned:

In [3]: import unicodedata

In [4]: unicodedata.normalize('NFC', s)
Out[4]: u'\u83ef\u889e\u8207\u7dfc\U00026177\u540c\u6b78'


In [5]: len(unicodedata.normalize('NFC', s))
Out[5]: 8

Without taking drastic steps like recompiling Python for UTF-32, is there a simple way to get the correct length in situations like this?

I'm on IPython 0.13, Python 2.7.2, Mac OS 10.8.2.

The discussions [here](http://stackoverflow.com/questions/9934752/platform-specific-unicode-semantics-in-python-2-7) and [here](http://stackoverflow.com/questions/6922480/how-to-get-a-reliable-unicode-character-count-in-python) seem relevant. — DSM, Oct 16 '12 at 03:36
@DSM: Thanks for digging these up. Your first link shows Python compiled for UTF-32 ("wide build"), something I ruled out in my question. In the second, the reply by wberry shows an elaborate piece of code to actually count true characters. My default workaround is like the latter, but I am hoping there exists something built in and more direct. — brannerchinese, Oct 16 '12 at 04:11
I can't reproduce your result here (Ubuntu box, python 2.7.2). For the unicode u'\u83ef\u889e\u8207\u7dfc\U00026177\u540c\u6b78' I get a length of seven with both len(s) and len(unicode.normalize('NFC', s)) — Vicent, Oct 16 '12 at 07:20
It's probably highly version-dependent. Python3.3 should deal more gracefully with this, since, by default, it never creates surrogate pairs(even though you can create them by hand). — Bakuriu, Oct 16 '12 at 07:40
It isn't UTF-8 that represents the non-BMP character by a surrogate pair. It is UTF-16, or rather the hack that Python used in versions < 3.3 on narrow builds. (Well, you _could_ take the surrogate pairs as in UTF-16, and encode each of the two surrogates using UTF-8, but this is explicitly prohibited by RFC 3629 though many UTF-8 implementations do it: it's called [WTF-8](https://en.wikipedia.org/w/index.php?title=UTF-8&oldid=750909440#WTF-8). But the only way a string can get encoded in UTF-8 this way is if it originally came from UTF-16). See chrispy's answer below for a simple solution. — ShreevatsaR, Nov 22 '16 at 19:57
BTW, even Python 2 on narrow builds does the right thing with UTF-8: for your `s` above, `len(s.encode('utf-8'))` gives 22, which comes from encoding six of the seven characters using 3 bytes each, and the other one using 4 bytes. UTF-8 here is _not_ doing the wrong thing of encoding the surrogate pairs separately (thank god) which would lead to a length of 8*3=24 bytes. — ShreevatsaR, Nov 22 '16 at 20:28

score 8 · Accepted Answer · answered Oct 20 '12 at 16:10

8

I think this has been fixen in 3.3. See:

http://docs.python.org/py3k/whatsnew/3.3.html
http://www.python.org/dev/peps/pep-0393/ (search for wstr_length)

answered Oct 20 '12 at 16:10

Ecir Hana

10,864
13
67
117

Yes. But in 2.7 we are apparently on our own, unless we are using a wide build. It will be a while before I can move to Py3, unfortunately. – brannerchinese Oct 24 '12 at 02:16
1

I moved to Py3 in February, and (except when I am forced back into 2.7 by libraries such as NLTK) my troubles with surrogate pairs are over. This is indeed now the best solution. – brannerchinese May 09 '13 at 22:33

score 7 · Answer 2 · answered Apr 14 '15 at 17:42

7

I make a function to do this on Python 2:

SURROGATE_PAIR = re.compile(u'[\ud800-\udbff][\udc00-\udfff]', re.UNICODE)
def unicodeLen(s):
  return len(SURROGATE_PAIR.sub('.', s))

By replacing surrogate pairs with a single character, we 'fix' the len function. On normal strings, this should be pretty efficient: since the pattern won't match, the original string will be returned without modification. It should work on wide (32-bit) Python builds, too, as the surrogate pair encoding will not be used.

answered Apr 14 '15 at 17:42

Alice Purcell

12,622
6
51
57

This won't work with 4-byte unicode characters, e.g. – wojcikstefan Dec 01 '16 at 21:40
@wojcikstefan It should do, why do you say that? The surrogate pair mechanism encodes anything that doesn't fit in UTF-16; is D83D DCAA, for example. – Alice Purcell Dec 05 '16 at 13:42
I would expect a single bicep char (like the one above) to return a length of `1`, but `unicodeLen(u'\U0001f4aa\U0001f3ff')` returns `2`. Is my expectation incorrect @chrispy? – wojcikstefan Mar 04 '17 at 22:45
1

It doesn't handle emoji modifiers! – Alice Purcell Mar 06 '17 at 13:02

score 3 · Answer 3 · edited May 23 '17 at 12:10

3

You can override the len function in Python (see: How does len work?) and add an if statement in it to check for the extra long unicode.

edited May 23 '17 at 12:10

Community

1
1

answered May 08 '13 at 22:16

schilippe

79
3

Python: getting correct string length when it contains surrogate pairs

3 Answers3

Linked