9

In Python 2.7:

In [2]: utf8_str = '\xf0\x9f\x91\x8d'
In [3]: print(utf8_str)

In [4]: unicode_str = utf8_str.decode('utf-8')
In [5]: print(unicode_str)
 
In [6]: unicode_str
Out[6]: u'\U0001f44d'
In [7]: len(unicode_str)
Out[7]: 2

Since unicode_str only contains a single unicode code point (0x0001f44d), why does len(unicode_str) return 2 instead of 1?

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Tom
  • 681
  • 8
  • 14

1 Answers1

16

Your Python binary was compiled with UCS-2 support (a narrow build) and internally anything outside of the BMP (Basic Multilingual Plane) is represented using a surrogate pair.

That means such codepoints show up as 2 characters when asking for the length.

You'll have to recompile your Python binary to use UCS-4 instead if this matters (./configure --enable-unicode=ucs4 will enable it), or upgrade to Python 3.3 or newer, where Python's Unicode support was overhauled to use a variable-width Unicode type that switches between ASCII, UCS-2 and UCS-4 as required by the codepoints contained.

On Python versions 2.7 and 3.0 - 3.2, you can detect what kind of build you have by inspecting the sys.maxunicode value; it'll be 2^16-1 == 65535 == 0xFFFF for a narrow UCS-2 build, 1114111 == 0x10FFFF for a wide UCS-4 build. In Python 3.3 and up it is always set to 1114111.

Demo:

# Narrow build
$ bin/python -c 'import sys; print sys.maxunicode, len(u"\U0001f44d"), list(u"\U0001f44d")'
65535 2 [u'\ud83d', u'\udc4d']
# Wide build
$ python -c 'import sys; print sys.maxunicode, len(u"\U0001f44d"), list(u"\U0001f44d")'
1114111 1 [u'\U0001f44d']
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • you can use `sys.maxunicode` on Python 3 too. It is implied but it is worth pointing out it explicitly that `len(u'\U0001f44d') == 1` on Python 3.3+ (or a wide Python 2 build) – jfs Feb 19 '16 at 15:32
  • @J.F.Sebastian: sure, but as of 3.3 it is a constant there, as Python 3.3 and up transparently switch between ASCII, UCS-2 an UCS-4 storage for strings as required. And you really don't want to use Python < 3.3 anyway. – Martijn Pieters Feb 19 '16 at 15:35
  • There is no narrow/wide distinction on Python 3.3+ (the internal representation is not exposed -- you don't care what python uses internally). The point that you could use `sys.maxunicode` regardless of the version. – jfs Feb 19 '16 at 15:42
  • 1
    I never said there was such a distinction. – Martijn Pieters Feb 19 '16 at 16:00
  • Yes, that is why `narrow_mode = (sys.maxunicode < 0x10ffff)` could be used on any version (both Python 2 and 3). – jfs Feb 19 '16 at 22:41
  • 1
    My system is running Python 3.6 and I double checked `sys.maxunicode` value to be `1114111`, but still the length of this emoji/string is still displaying as 2 :_( – ankit Oct 30 '22 at 10:28
  • @ankit: I can't reproduce that: `python3.6 -c 'import sys; print(sys.version_info, sys.maxunicode, len("\U0001f44d"), list("\U0001f44d"))'` outputs `sys.version_info(major=3, minor=6, micro=15, releaselevel='final', serial=0) 1114111 1 ['']`‎ – Martijn Pieters Nov 25 '22 at 16:47